Chapter 3. Technical SEO

Index Your Website – How To Fix Page Indexing Issues

Index Your Website – How To Fix Page Indexing Issues
Tag:

Ensuring your website is fully visible to search engines is essential for driving traffic and improving your online presence. In this learning section, we'll give you tips that can help you to index your website properly and swiftly; we'll aslo explain how to fix page indexing issues when encountering them.

What is Index?

In SEO, an Index refers to the database of web pages used by a search engine.

When a page is "indexed", it means that the search engine has visited the page, analyzed its content, and stored it in its database. Being indexed is crucial for a web page because it is the first step toward appearing in search engine results pages (SERPs) when users search for related terms.

Search engines like Google use crawlers or bots to discover new pages and updates to existing pages, which are then analyzed for content, relevance, and quality before being added to the index. If a page is not indexed, it cannot be found through search engine queries, making indexing a fundamental aspect of SEO strategies to improve visibility and drive traffic to a website.

5 key things to know to index your website properly

As search engines follow specific rules to index your website and pages, you need to know the rules. Here, we'll explain 5 key points that can help you to index your website.

Google Search Central provides a comprehensive guide about indexing pages. Please refer to Overview of crawling and indexing issues.

Implement sitemap.xml

A sitemap.xml file lists all important pages of your site to ensure Google can discover and crawl them. Submit your sitemap through Google Search Console to help indexing of your content efficiently. The way to create sitemap.xml can differ depending on the technology platform of your website. If you use WordPress, the easiest way is to use plugins such as Yoast SEO.

You can also create sitemap.xml manually using a text editor. When you make the sitemap file yourself, you must ensure it is placed in the website's root directory.

This is an example of a sitemap.xml file.

Sitemap.xml Example
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2024-02-20</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://www.example.com/about</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
:
</urlset>

Once the sitemap file is placed, Google needs to recognize it. You can check the sitemap status in Google Search Console by selecting sitemaps on the left sidebar.

Google Search Console UI - Sitemaps

If your sitemap still needs to be recognized by Google, submit its URL through the Search Console.

Google Search Console UI - Add a new sitemap

Sitemap.xml is accessible publicly. You can check if your sitemap is online. For example, this is Apple's sitemap.xml.

Apple's sitemap ( https://www.apple.com/sitemap.xml )
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<script/>
<url>
<loc>https://www.apple.com/</loc>
</url>
<url>
<loc>https://www.apple.com/accessibility/</loc>
</url>
<url>
<loc>https://www.apple.com/accessibility/assistive-technologies/</loc>
</url>
: 
</urlset>

Sitemap.xml has a size limitation. One sitemap file should be at most 50MB (uncompressed) and include no more than 50,000 URLs. You can split sitemaps using a nesting structure when you create a parent sitemap.xml that refers to child sitemaps.

Avoid misuse of the noindex tag (robots directive in the meta tag)

To prevent search engines from indexing a webpage, you can use the noindex meta tag. This tag tells search engine crawlers that the page should not be added to their index, so it won't appear in search results. Below is an example of the noindex tag. It is often used with the nofollow tag, instructing search engines not to pass authority or influence to the linked page.

Noindex and Nofollow Example
<meta name="robots" content="noindex, nofollow">

You should use the noindex tag only on the pages you don't want to show in search results, such as duplicate pages, private pages, or temporary content.

If you mistakenly add the noindex tag on the pages you want to be indexed, the pages won't be indexed.

Also, the noindex tag doesn't address security concerns. Search engines may find the page from hyperlinks. If you want to prevent the pages from being publicly accessible, you need to find another way to block public access to them, such as user login.

Removal from Index

If a page was previously indexed and you've added a noindex tag, it may take some time for search engines to revisit the page, see the noindex directive, and remove it from their index.

If you want to remove the pages from the Index quickly, you can request removals through the Search Console. Select Removals on the left sidebar in the Search Console.

Removals Menu on Search Console

Press the NEW REQUEST button and submit a URL for removal.

New Request for Removals

Use robots.txt accurately

The robots.txt file provides instructions to search engine bots. Using the robots.txt file, you can disallow crawlers to crawl particular pages. You can also let crawlers know the location of the sitemap.xml files using the robots.txt file.

You can also check the robots.txt file online using a URL like example.com/robots.txt.

This is an example of a robot.txt file.

Robots.txt Example ( https://www.apple.com/robots.txt )
# robots.txt for http://www.apple.com/
User-agent: *
Disallow: /*/includes/*
Disallow: /*retail/availability*
Disallow: /*retail/availabilitySearch* 
:
Sitemap: https://www.apple.com/shop/sitemap.xml
Sitemap: https://www.apple.com/autopush/sitemap/sitemap-index.xml
Sitemap: https://www.apple.com/newsroom/sitemap.xml Sitemap: https://www.apple.com/retail/sitemap/sitemap.xml Sitemap: https://www.apple.com/today/sitemap.xml

Similarly to the sitemap.xml file, you need to place the robots.txt file in the website's root directory.

To promote better indexing, ensure your robots.txt file doesn't inadvertently block access to important pages you want indexed.

Implement canonical tags (link tag)

Content duplication can be penalized by search engines. When you have similar or duplicate content across multiple URLs, use the canonical tag to specify which version of the page you want to be considered as the authoritative (canonical) one. This helps prevent duplicate content issues.

This is an example of the canonical tag.

Canonical tag example
<link rel="canonical" href="https://example.com/example-page">

There are several checkpoints when you implement the canonical tag.

  1. Use canonical URLs consistently: Ensure you use the same protocol (http vs. https) and subdomain (www vs. non-www) because different protocols and subdomains are considered as different URLs by search engines.
  2. Use absolute URLs: Always use the absolute URL (full path) in the href attribute of the canonical link, not a relative URL. This avoids confusing search engines.
  3. Use canonical URLs in sitemap.xml: Include only the canonical URLs of your content in your XML sitemap.

Setting the canonical tag for all indexed pages is not mandatory, but it is recommended because it helps to prevent potential issues related to duplicate content and consolidate link signals for similar or identical content.

The canonical tag is also related to redirection settings and hreflang tags. For redirection settings, check the material below. For the hreflang tag, check Geographical SEO: Local SEO and International SEO.

Manage redirections correctly

As the website evolves, some pages may need to change their URLs. If web pages have been indexed for a while, they may already have good backlinks, adding value (link juice) to the pages. If the page has been permanently moved, set 301 or 308 redirections to avoid losing this accumulated value.

The method of setting redirections varies depending on the technology platform. Most platforms use 301 redirection, but some may offer 308 redirection settings.

Permanent Redirection - 301 and 308 Redirect

How to fix page indexing issues?

When facing indexing issues, you must utilize the Search Console fully. Search Console provides several features to fix indexing issues.

Check indexing status

First, you need to check your indexing status using Search Console. Click Pages under the indexing section on the left sidebar. Search Console explains "Why pages aren't indexed".

There are several issues that might be raised here, such as:

  • Page with redirect
  • Excluded by ‘noindex’ tag
  • Alternate page with proper canonical tag
  • Blocked by robots.txt
  • Crawled - currently not indexed
  • Not found (404)
  • Discovered - currently not indexed
Google Search Console Indexing Issues

When you click one of the issues, you can see how many URLs are facing the same issue.

Page Indexing Issue Example - Alternate Page with Proper Canonical Tag

Search Console Messages and Validate Fix

When Search Console detects issues, it sends a message to you through Gmail and on its platform. You can check the history of messages by clicking the alert icon on the top left.

Accessing Google Search Console Messages - Alert Icon
Google Search Console Messages

Open one of the messages and click the Open indexing report button. You'll then be able to access the issue report page explained above.

When you solve the issue, you need to press the VALIDATE FIX button on each issue page. Google will start to validate whether the problem is fixed.

Page Indexing Issue Example - Excluded by nonindex tag

URL inspection and request indexing

Even though no issue is detected, your pages may not be indexed if they are relatively new. Usually, it takes a few weeks for new pages to be indexed. This is because Google crawlers have a crawling budget (how many pages Google can crawl for each website daily). It takes time for the crawlers to crawl all pages on your website.

You can also make requests through the Search Console to ensure crawlers crawl your pages. In the URL Inspection box, type the page URL.

URL Inspection UI Example

Search Console provides detailed information on the page indexing status. If the page needs to be indexed, press the REQUEST INDEXING link.

Request Indexing UI Example

Indexing requests also have a quota for each day. If you have many pages to request indexing, you need to do it on different days.

Indexing API to accelerate Indexing

To accelerate indexing for many web pages, consider using Indexing API. Indexing API directly notifies the search engine about new or updated web pages for faster indexing.

Implementing the Indexing API requires technical knowledge; if you are not an engineer, you need to work with someone who can handle the API. Google Search Central provides a detailed explanation of the Indexing API: Google Search Central Indexing API.

Here are the key steps to implement Indexing API.

  1. Create a project on GCP (Google Cloud Platform)
  2. Create a service account on GCP (Google Cloud Platform) and download a private key (JSON)
  3. Add the service account as a site owner through Google Search Console using the Service account ID (e.g., my-service-account@project-name.google.com.iam.gserviceaccount.com)
  4. Get an access token (set up OAuth token using libraries)
  5. Use the Indexing API - Send a list of URLs for indexing requests with type (URL_UPDATED)

The Indexing API sets the default daily per-project quota at 200. If your website has more than 200 pages for indexing, you need to run the Indexing API code multiple times on different days.

Below is an example of a result of the Indexing API.

In this case, the indexing speed was initially slow (only <30% of pages were indexed) until the Indexing API was executed. After running the Indexing API code two times, this site got almost >99% of pages indexed out of 443 pages.

Result of Indexing API - Quickly Indexed Pages

Conclusion

Mastering the techniques of indexing your website and resolving page indexing issues is critical for enhancing your site's search engine visibility.

Implementing a sitemap.xml file, judiciously using the noindex tag, accurately configuring robots.txt, applying canonical tags, and managing redirections are essential. These actions ensure your content is discoverable, avoid duplicate content issues, and optimize your website for search engines.

Additionally, leveraging tools like Google Search Console for troubleshooting and employing the Indexing API for large-scale indexing can significantly improve your website's indexing speed.

Ultimately, these strategies facilitate better indexing and bolster your online presence, driving more traffic to your website.


You can also learn this topic on Kindle. ClickAmazonKindle.

Ensuring your website is fully visible to search engines is essential for driving traffic and improving your online presence. In this learning section, we'll give you tips that can help you to index your website properly and swiftly; we'll aslo explain how to fix page indexing issues when encountering them.

What is Index?

In SEO, an Index refers to the database of web pages used by a search engine.

When a page is "indexed", it means that the search engine has visited the page, analyzed its content, and stored it in its database. Being indexed is crucial for a web page because it is the first step toward appearing in search engine results pages (SERPs) when users search for related terms.

Search engines like Google use crawlers or bots to discover new pages and updates to existing pages, which are then analyzed for content, relevance, and quality before being added to the index. If a page is not indexed, it cannot be found through search engine queries, making indexing a fundamental aspect of SEO strategies to improve visibility and drive traffic to a website.

5 key things to know to index your website properly

As search engines follow specific rules to index your website and pages, you need to know the rules. Here, we'll explain 5 key points that can help you to index your website.

Google Search Central provides a comprehensive guide about indexing pages. Please refer to Overview of crawling and indexing issues.

Implement sitemap.xml

A sitemap.xml file lists all important pages of your site to ensure Google can discover and crawl them. Submit your sitemap through Google Search Console to help indexing of your content efficiently. The way to create sitemap.xml can differ depending on the technology platform of your website. If you use WordPress, the easiest way is to use plugins such as Yoast SEO.

You can also create sitemap.xml manually using a text editor. When you make the sitemap file yourself, you must ensure it is placed in the website's root directory.

This is an example of a sitemap.xml file.

Sitemap.xml Example
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2024-02-20</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://www.example.com/about</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
:
</urlset>

Once the sitemap file is placed, Google needs to recognize it. You can check the sitemap status in Google Search Console by selecting sitemaps on the left sidebar.

Google Search Console UI - Sitemaps

If your sitemap still needs to be recognized by Google, submit its URL through the Search Console.

Google Search Console UI - Add a new sitemap

Sitemap.xml is accessible publicly. You can check if your sitemap is online. For example, this is Apple's sitemap.xml.

Apple's sitemap ( https://www.apple.com/sitemap.xml )
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<script/>
<url>
<loc>https://www.apple.com/</loc>
</url>
<url>
<loc>https://www.apple.com/accessibility/</loc>
</url>
<url>
<loc>https://www.apple.com/accessibility/assistive-technologies/</loc>
</url>
: 
</urlset>

Sitemap.xml has a size limitation. One sitemap file should be at most 50MB (uncompressed) and include no more than 50,000 URLs. You can split sitemaps using a nesting structure when you create a parent sitemap.xml that refers to child sitemaps.

Avoid misuse of the noindex tag (robots directive in the meta tag)

To prevent search engines from indexing a webpage, you can use the noindex meta tag. This tag tells search engine crawlers that the page should not be added to their index, so it won't appear in search results. Below is an example of the noindex tag. It is often used with the nofollow tag, instructing search engines not to pass authority or influence to the linked page.

Noindex and Nofollow Example
<meta name="robots" content="noindex, nofollow">

You should use the noindex tag only on the pages you don't want to show in search results, such as duplicate pages, private pages, or temporary content.

If you mistakenly add the noindex tag on the pages you want to be indexed, the pages won't be indexed.

Also, the noindex tag doesn't address security concerns. Search engines may find the page from hyperlinks. If you want to prevent the pages from being publicly accessible, you need to find another way to block public access to them, such as user login.

Removal from Index

If a page was previously indexed and you've added a noindex tag, it may take some time for search engines to revisit the page, see the noindex directive, and remove it from their index.

If you want to remove the pages from the Index quickly, you can request removals through the Search Console. Select Removals on the left sidebar in the Search Console.

Removals Menu on Search Console

Press the NEW REQUEST button and submit a URL for removal.

New Request for Removals

Use robots.txt accurately

The robots.txt file provides instructions to search engine bots. Using the robots.txt file, you can disallow crawlers to crawl particular pages. You can also let crawlers know the location of the sitemap.xml files using the robots.txt file.

You can also check the robots.txt file online using a URL like example.com/robots.txt.

This is an example of a robot.txt file.

Robots.txt Example ( https://www.apple.com/robots.txt )
# robots.txt for http://www.apple.com/
User-agent: *
Disallow: /*/includes/*
Disallow: /*retail/availability*
Disallow: /*retail/availabilitySearch* 
:
Sitemap: https://www.apple.com/shop/sitemap.xml
Sitemap: https://www.apple.com/autopush/sitemap/sitemap-index.xml
Sitemap: https://www.apple.com/newsroom/sitemap.xml Sitemap: https://www.apple.com/retail/sitemap/sitemap.xml Sitemap: https://www.apple.com/today/sitemap.xml

Similarly to the sitemap.xml file, you need to place the robots.txt file in the website's root directory.

To promote better indexing, ensure your robots.txt file doesn't inadvertently block access to important pages you want indexed.

Implement canonical tags (link tag)

Content duplication can be penalized by search engines. When you have similar or duplicate content across multiple URLs, use the canonical tag to specify which version of the page you want to be considered as the authoritative (canonical) one. This helps prevent duplicate content issues.

This is an example of the canonical tag.

Canonical tag example
<link rel="canonical" href="https://example.com/example-page">

There are several checkpoints when you implement the canonical tag.

  1. Use canonical URLs consistently: Ensure you use the same protocol (http vs. https) and subdomain (www vs. non-www) because different protocols and subdomains are considered as different URLs by search engines.
  2. Use absolute URLs: Always use the absolute URL (full path) in the href attribute of the canonical link, not a relative URL. This avoids confusing search engines.
  3. Use canonical URLs in sitemap.xml: Include only the canonical URLs of your content in your XML sitemap.

Setting the canonical tag for all indexed pages is not mandatory, but it is recommended because it helps to prevent potential issues related to duplicate content and consolidate link signals for similar or identical content.

The canonical tag is also related to redirection settings and hreflang tags. For redirection settings, check the material below. For the hreflang tag, check Geographical SEO: Local SEO and International SEO.

Manage redirections correctly

As the website evolves, some pages may need to change their URLs. If web pages have been indexed for a while, they may already have good backlinks, adding value (link juice) to the pages. If the page has been permanently moved, set 301 or 308 redirections to avoid losing this accumulated value.

The method of setting redirections varies depending on the technology platform. Most platforms use 301 redirection, but some may offer 308 redirection settings.

Permanent Redirection - 301 and 308 Redirect

How to fix page indexing issues?

When facing indexing issues, you must utilize the Search Console fully. Search Console provides several features to fix indexing issues.

Check indexing status

First, you need to check your indexing status using Search Console. Click Pages under the indexing section on the left sidebar. Search Console explains "Why pages aren't indexed".

There are several issues that might be raised here, such as:

  • Page with redirect
  • Excluded by ‘noindex’ tag
  • Alternate page with proper canonical tag
  • Blocked by robots.txt
  • Crawled - currently not indexed
  • Not found (404)
  • Discovered - currently not indexed
Google Search Console Indexing Issues

When you click one of the issues, you can see how many URLs are facing the same issue.

Page Indexing Issue Example - Alternate Page with Proper Canonical Tag

Search Console Messages and Validate Fix

When Search Console detects issues, it sends a message to you through Gmail and on its platform. You can check the history of messages by clicking the alert icon on the top left.

Accessing Google Search Console Messages - Alert Icon
Google Search Console Messages

Open one of the messages and click the Open indexing report button. You'll then be able to access the issue report page explained above.

When you solve the issue, you need to press the VALIDATE FIX button on each issue page. Google will start to validate whether the problem is fixed.

Page Indexing Issue Example - Excluded by nonindex tag

URL inspection and request indexing

Even though no issue is detected, your pages may not be indexed if they are relatively new. Usually, it takes a few weeks for new pages to be indexed. This is because Google crawlers have a crawling budget (how many pages Google can crawl for each website daily). It takes time for the crawlers to crawl all pages on your website.

You can also make requests through the Search Console to ensure crawlers crawl your pages. In the URL Inspection box, type the page URL.

URL Inspection UI Example

Search Console provides detailed information on the page indexing status. If the page needs to be indexed, press the REQUEST INDEXING link.

Request Indexing UI Example

Indexing requests also have a quota for each day. If you have many pages to request indexing, you need to do it on different days.

Indexing API to accelerate Indexing

To accelerate indexing for many web pages, consider using Indexing API. Indexing API directly notifies the search engine about new or updated web pages for faster indexing.

Implementing the Indexing API requires technical knowledge; if you are not an engineer, you need to work with someone who can handle the API. Google Search Central provides a detailed explanation of the Indexing API: Google Search Central Indexing API.

Here are the key steps to implement Indexing API.

  1. Create a project on GCP (Google Cloud Platform)
  2. Create a service account on GCP (Google Cloud Platform) and download a private key (JSON)
  3. Add the service account as a site owner through Google Search Console using the Service account ID (e.g., my-service-account@project-name.google.com.iam.gserviceaccount.com)
  4. Get an access token (set up OAuth token using libraries)
  5. Use the Indexing API - Send a list of URLs for indexing requests with type (URL_UPDATED)

The Indexing API sets the default daily per-project quota at 200. If your website has more than 200 pages for indexing, you need to run the Indexing API code multiple times on different days.

Below is an example of a result of the Indexing API.

In this case, the indexing speed was initially slow (only <30% of pages were indexed) until the Indexing API was executed. After running the Indexing API code two times, this site got almost >99% of pages indexed out of 443 pages.

Result of Indexing API - Quickly Indexed Pages

Conclusion

Mastering the techniques of indexing your website and resolving page indexing issues is critical for enhancing your site's search engine visibility.

Implementing a sitemap.xml file, judiciously using the noindex tag, accurately configuring robots.txt, applying canonical tags, and managing redirections are essential. These actions ensure your content is discoverable, avoid duplicate content issues, and optimize your website for search engines.

Additionally, leveraging tools like Google Search Console for troubleshooting and employing the Indexing API for large-scale indexing can significantly improve your website's indexing speed.

Ultimately, these strategies facilitate better indexing and bolster your online presence, driving more traffic to your website.


You can also learn this topic on Kindle. ClickAmazonKindle.

Tag: