An XML sitemap is a simple list of URLs that is provided to search engines. It is a key tool that can help you identify indexing problems.
This article will show you how you can easily and quickly identify indexing problems using a good sitemap structure as well as how you can use OnPage.org Zoom to determine the causes of these problems.
In a Google Webmaster Central Hangout, John Mueller asserted that deep website structures are only a problem for websites with less than 10,000 URLs. Based on this statement, it would seem that there is no need to create an XML sitemap for small websites since Google can fully index such websites without problems.
However, SEOs should always provide one or more XML sitemaps for the different website sections regardless of the size of the website. This helps you to easily identify areas where indexing problems could arise and implement appropriate measures in due time.
The structure of the XML sitemap does not have any effect on the indexing. Nevertheless, a well-designed sitemap helps you to easily and quickly identify weak points in your website structure. This enables you to easily limit indexing problems to specific website sections.
Segment your sitemaps based on webpage types, sections, products in order to easily associate indexing problems with specific areas.
Do not exceed the recommended maximum size of a sitemap (50,000 URLs or 10 MB uncompressed). Distribute the URLs on different sitemaps.
Create one sitemap index file if you have many sitemaps.
Keep the number of URLs per sitemap low in order to better identify indexing problems.
Only list URLs that are relevant to the index (status code “200ok”, meta robots: index, no, or self-referencing canonical tag).
Only use URLs that are not blocked in the robots.txt file.
Create a separate sitemap or sitemap structure for each domain or sub domain.
Do not add extra information on URLs in the sitemap. Instead, do this directly on the URL, e.g., using hreflang or lastmodified.
Only use absolute URLs.
Try and avoid using GET parameters. Tracking and filter parameters should not be in the sitemap.
A web blog lives and grows through new posts. The last 10 posts are usually accessible from the homepage. Every new post pushes the oldest to the next page. This not only happens on the homepage but also on category and tag pages. As a result, the latest article is always at the top, whereas old articles are gradually pushed to the back. Failure to use an appropriate internal linking system can therefore result in an excessive extension of the click path with each new article.
In order to identify indexing problems caused by internal linking, it is advisable to create a new sitemap for blog posts alone each month such as sitemap-post-mmyy.xml.
If you also publish static pages at regular intervals, you can also do the same.
The sitemap can have the following name: sitemap-page-MMYY.xml
One typical feature of online shops is the large number of products. Most of the product descriptions often come from data feeds that offer product images and detailed texts. This leads to many online shops having the same product descriptions, which in turn results in duplicate content. Pages that do not have unique content are not relevant for the index and should therefore be left out of the XML sitemap.
On the other hand, you should always create an XML sitemap for your unique pages that have individualized product information. It is advisable to cluster URLs based on different categories and then classify them in corresponding sitemaps. Examples of categories that could be suitable for online shops include:
Theme/category such as trousers, shoes, etc.
Brand such as Adidas, Puma, Nike, etc.
Page type or template: category page, landing page, etc.
Creation date based on day, week, month, or year
Top sellers
The additional effort you put towards a detailed clustering of URLs and creating a complex sitemap structure pays off at least as soon as you submit the sitemap to the search engine.
If you add a sitemap in the Google Search Console, you soon receive information about the indexing rate of the URLs in the sitemap.
Figure 1: Number of submitted sitemaps and indexed URLs per sitemap in the Google Search Console
The more transparent the sitemap structure is, the easier it is to limit indexing problems to the corresponding website sections. If you classify the URLs on your website, e.g., based on directories, the Google Search Console immediately shows you the directories whose indexing took relatively long. You can therefore look into the causes of such indexing problems with a targeted and more effective approach if you already know the areas that are affected.
Tip: It is advisable to visualize the indexing rate using a radar chart in Excel.
Figure 2: Number of indexed URLs / submitted URLs = Indexing rate per sitemap
The goal should be to have all the submitted URLs indexed by Google. This would mean a 100% indexing rate. Here, a low indexing rate can have different causes.
A 100% indexing rate is only possible if you only have valid and indexable content in the sitemap. The better you maintain your XML sitemap, the higher will the indexing rate be.
OnPage.org Zoom can help you to easily identify common errors in the XML sitemap.
1. The content does not return the "200 OK" status code
A well-maintained XML sitemap provides search engines with a list of valid URLs that are relevant for the index. You should ensure that all content specified in the sitemap is always accessible. The "Status Codes" report in OnPage.org Zoom allows you to check the accessibility of the content specified in your XML sitemap.
You can easily analyze faulty or redirected pages using convenient filters. Simply click on the sections that are highlighted in yellow or red. This will display a table showing the corresponding URLs. In addition, you will also find information relating to the sitemap file with the URL.
Figure 3: Analyze the status codes returned by the content specified in the XML sitemap
2. The sitemap contains content that cannot be indexed
There is no reason why you should have non-indexable content in an XML file. Analyzing each of the URLs given in the sitemap to check if they are indexable is a very sophisticated process.
OnPage.org saves you this effort and easily shows you if your XML sitemap has any URLs that cannot be indexed. Simply select the "What is included" report under "Sitemaps" and activate a manual filter by clicking on "Add filter". Next, select the "Indexability" category and choose "Only non-indexable pages/assets". The report now shows you all URLs that cannot be indexed. Clicking on "What is included" shows you URLs that are given in the sitemap but whose content cannot be indexed.
Figure 4: Identify non-indexable URLs in the XML sitemap
3. Your XML sitemap does not have all the relevant content
The more complex your website structure is and the larger your website is, the harder it is to add all the relevant URLs in the XML sitemap. In particular, it becomes very easy to leave out or oversee new or insufficiently interlinked content. OnPage.org helps you to quickly identify pages that are not included in the XML sitemap.
To do this, go to the "What is included?" report, add an "Indexability" filter with the option to view "only indexable pages / assets", and click on the "Not included" graph. This shows you a list of all URLs that are relevant for the index but have not been included in the XML sitemap.
Figure 5: Identify URLs that are relevant for the indexing and have been left out of the sitemap
An intelligently structured XML sitemap can help you determine the indexing rate of the various sections of your websites using the Google Search Console. This helps you to easily identify indexing problems on your website. The selected structure could be a simple copy of the information architecture of the website or a sophisticated structure with special sitemaps for online shops.
The important thing is to make sure you opt for a structure that is suitable for your website and that can easily reflect on possible weak points on your website.
Published on Aug 18, 2016 by Stephan Walcher