The robots.txt file instructs search engines how to crawl your site. In this article, we explain the most common errors, and how you can avoid them.
Every webmaster knows that there are certain aspects of a website that you don't want crawled or indexed. The robots.txt file gives you the opportunity to specify these sections and convey this to the search engine crawlers. In this article, we will show common errors that can occur when creating a robots.txt file, how you can avoid them, and how you can monitor your robots.txt file with Ryte's software.
There are many reasons why website operators may want exclude certain parts of a website from the search engine index, for example if pages are hidden behind a log in, are archived, or you want to test pages of a website before they go live. "A Standard for Robot Exclusion" was published in 1994 to make this possible. This protocol lays out the guidelines that before beginning the crawl, the search engine crawler would firstly search for the robots.txt file in the root directory and read the instructions in the file.
Many possible errors can occur when creating the robots.txt file, for example syntax errors if an instruction is not written correctly, or errors resulting from unintentional blocking of a directory. Here are some of the most common robots.txt errors:
robots.txt is a simple text file and can easily be created using a text editor. An entry in the robots.txt file always consists of two parts: the first part specifies the user agent to which the instruction should apply (e.g. Googlebot), and the second part contains commands, such as "Disallow", and contains a list of all sub-pages that should not be crawled. For the instructions in the robots.txt file to take effect, correct syntax should be used as shown below.
User-agent: Googlebot
Disallow: /example_directory/
In the above example, Google’s crawler is forbidden from crawling the /example_directory/. If you want this to apply to all crawlers, you should use the following code in your robots.txt file:
User-agent: *
Disallow: /example_directory/
The asterix (also known as wild card) serves as a variable for all crawlers. Similarly, you can use a slash (/) to prevent the entire website from being indexed (e.g. for a test version before a relaunch).
User-agent: *
Disallow: /
When excluding a directory from crawls, you should always remember to add the slash at the end of the directory’s name. For example,
Disallow: /directory not only blocks /directory/, but also /directory-one.html
If you want to exclude several pages from the indexing, you should add each directory in a different row. Adding multiple paths in the same row usually leads to unwanted errors.
User-agent: googlebot
Disallow: /example-directory/
Disallow: /example-directory-2/
Disallow: /example-file.html
Before the robots.txt file is uploaded to the root directory of the website, you should always check if its syntax is correct. Even the smallest error could result in the crawler ignoring the instructions in the file and leading to the crawling of pages that should not be indexed. Always make sure that directories which should not be indexed, are listed after the Disallow: command.
Even in cases when your website’s page structure changes e.g. due to a relaunch, you should always check the robots.txt file for errors. You can easily do this using the free testing tool on Ryte.
The most common error associated with the robots.txt file is failing to save the file in the website’s root directory. The sub-directories are usually ignored since the user agents only searches for the robots.txt file in the root directory.
The correct URL for a website’s robots.txt file should have the following format:
http://www.your-website.com/robots.txt
If the pages blocked in your robots.txt file have redirects to other pages, the crawler might not recognize the redirects. In the worst case scenario, this could cause the page to still be displayed in the search results but under an incorrect URL. In addition, the Google Analytics data for your project may also be incorrect.
The new robots.txt Monitoring on Ryte helps you avoid such errors. In "Monitoring" >> "robots.txt Monitoring", the accessibility of your robots.txt file is checked every hour (status 200). If the file cannot be accessed, you are automatically sent an email notification that your robots.txt is currently inaccessible.
Figure 1: robots.txt monitoring with Ryte
Even if your robots.txt returns the status code 200 (accessible), Ryte still goes ahead and checks if the file has changed. If it has, the tool checks the number of changes, and if more than 5 changes on the file are identified, you are automatically sent an email asking you to check the file and confirm whether or not the changes were intended.
Find out more about how to monitor your robots.txt file with Ryte here.
It is important to note that excluding pages in the robots.txt does not necessarily imply that the pages will not be indexed. For example, if a URL excluded from crawling in the robots.txt file is linked to an external page. The robots.txt file simply gives you control over the user agent. However, the following often appears instead of the Meta description since the bot is prohibited from crawling:
"A description for this result is not available because of this site's robots.txt."
Figure 4: Snippet example of a page that is blocked using the robots.txt file but still indexed
As you can see, just one link on the respective page is enough to result in the page being indexed, even if the URL is set to "Disallow" in the robots.txt file. Similarly, using the <noindex>-tag can, in this case, not prevent the indexing since the crawler never got to read this part of the code due to the disallow command in the robots.txt file.
In order to prevent certain URLs from showing up in the Google index, you should use the <noindex>-tag, but still give the crawler access to this directory.
Published on Apr 26, 2016 by Eva Wagner