Sitemap Index Files


Sitemap index files are index files of sitemaps, which facilitate the handling of many sitemaps. Sitemap index files can be viewed as directories that provide information about websites in XML format for search engines. The index files use descriptors to describe several documents so that search engine crawlers and bots can quickly capture and process these documents and the information they contain. Sitemap index files, like sitemaps, are based on XML markup language and are classified under the subject area of information retrieval. They are intended to positively influence indexing by notifying a crawler or bot where to search for sitemaps and when the sitemaps have last been modified.

General information

Sitemap index files specify individual properties of the XML files stored as sitemaps in UTF-8 character set on a server. This includes the location of the file, the time of the last change, and information about the language used. A sitemap index file may contain only certain information in XML format. Neither lists of pages are permitted nor content of the actual sitemaps (for example, URLs). Since a single sitemap cannot contain more than 50,000 entries, index files can be utilized for websites with a large number of URLs. The content of a website can be structured with several sitemaps, depending on its scope, before an index file is needed by the search engine as a reference to the sitemaps.

As a rule, sitemap directory files are used when many sitemaps are exist in order to prioritize the relevant information for crawlers and bots. Because they are based on XML, this information is arranged in a tree structure, which can be parsed by a crawler because of its special nesting. The maximum limit is 500 sitemap index files, the limit for the maximum size of the directories to be referenced is 50,000 (or 10 MB). Each sitemap index file must meet the XML syntax and can be compressed with gzip. If the files are to be validated beforehand, a schema must be specified against which the files can be checked.[1]

How it works

The structure of sitemap index files is described below as an example, whereby this file contains only two entries for sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

[2]

Tag definitions specify the necessary and optional elements of the file:

  • <sitemapindex>: This tag is the root of the document. It includes all nodes and thus the file. The namespace is specified under xmlns. With it you can define your own tag definitions. This line is also used to validate and verify the structure.
  • <sitemap>: Each sitemap that is to be referenced is included in this tag, which is a child element of the <sitemapindex>.
  • <loc>: Specifies the location of the sitemaps. The location is subordinate to the <sitemap> element as a tag.
  • <lastmod>: The <lastmod> tag specifies the time stamp of the last modification to a file. It is optional and subordinated to the <sitemap> tag.[3]

After creating a sitemap index file, it is stored in the host directory of the server. The directory should contain all sitemaps, so the crawler knows where to find the sitemaps. Subsequently, the sitemap index file can be submitted to search engines. The conditions of the search engine have to be taken into account as individual details can differ. For example, at Google, storing index files is tied to site owner confirmations in the Google Search Console when different sites are to be referenced by different domains. As a rule, index files for sitemaps are supported by all major search engines.[4]

Similar to sitemaps, index files promote the syndication of data. With constantly changing, dynamic content, sitemaps therefore need to be clearly structured. Because search engines automatically parse this data, the index of search engines can be kept up-to-date. Validation against a schema will detect errors in the syntax or the attribute-value pairs. Additional information in the header of the index file specifies the instance to be checked against. This is also referred to as a schema. A valid XML file is considered to be well-formed, if it does not have syntax errors.

Relevance to search engine optimization

Submitting a sitemap is generally recommended for any website. Sitemap index files, on the other hand, are intended only for special application scenarios. First and foremost, it concerns large websites with many URLs and content that is structured by a sitemap index file. Both search engines and webmasters get an overview of all content, URLs and the entire information architecture. Search engines use this data for indexing, which is why a transparent approach is recommended.

However, webmasters can also use this data in other ways, for example, for reporting and monitoring the website. Sitemap index files and sitemaps can be used to find duplicate content and solve problems with indexing.[5] Recent studies show that sitemap index files can significantly increase traffic of websites.[6] Although this statement cannot be generalized, a clear structure of sitemaps, URLs and content is advisable in any case.

References

  1. Inside Google Sitemaps: Using Sitemap Index Files sitemaps.blogspot.de. Accessed on 09/06/2016
  2. Sitemaps XML format sitemaps.org. Accessed on 09/06/2016
  3. Simplify multiple sitemap management support.google.com. Accessed on 09/06/2016
  4. Major Search Engines Unite to Support a Common Mechanism for Website Submission googlepress.blogspot.de. Accessed on 09/06/2016
  5. Optimize Your Sitemap Index blindfiveyearold.com. Accessed on 09/06/2016
  6. Multiple XML Sitemaps: Increased Indexation and Traffic moz.com. Accessed on 09/06/2016

Web Links