Anyone wanting to build a website will sooner or later come across the term “robots.txt“. This text file tells search engine crawlers which areas of a domain may be crawled and which may not.
The creation and proper placement of a robots.txt file isn’t magic – assuming your web directory is structured logically. In this article, we’ll show you how to create a robots.txt file and what you need to watch out for.
The robots.txt is a small text file that can be created very simply by using a text editor and is uploaded into a website’s root directory ("root"). Most web crawlers adhere to the robots exclusion standard protocol. This establishes that search engine robots (also: user agents) first search for a file with the designation robots.txt in the root directory and read that information before they begin indexing the site. Webmasters create a robots.txt file to be better control which areas of your website may be crawled by bots and which may not.
In the robots.txt file, you define instructions for the Google user agents. These can be browsers, but also a search engine’s robots (spiders, crawlers). The most common user agents are the Googlebot, Googlebot image (Google image search), Google Adsbot (Google AdWords), Slurp (Yahoo), and Bingbot (Bing).
Figure 1: Google user agents
Entries in the robots.txt consist of two parts. In the following example, they are two lines one after the other, but they can also consist of several lines, according to the number of commands and the user agent. In the upper portion, you address the user agent by name. After this, you invite it to an action.
With the following command, for example, the Googlebot is directed to exclude only the directory /cms/ from crawling:
If the instructions should apply to all crawlers, it should appear as:
If you want not only individual areas of your website to be ignored, but also the entire web presence, simply insert a forward slash:
If it’s only a special subpage or a picture that should be excluded (in this case, example file or example image), insert:
If all images on your web presence are of a private nature and should be excluded, you can work with a dollar sign: the "$" sign serves as a placeholder for a filter rule that takes effect at the end of a character string. The crawler doesn’t contain any contents that end on this character string. All jpg. files can therefore be excluded as follows:
In case a directory should be blocked and only a partial directory should be released for indexing, there is a solution. Supplement the code with the following lines:
If you want to exclude AdWords displays from the organic index, you can insert an exception into the code.
Tip: The XML sitemap should also be shown in the robots.txt file to communicate which URL structure a website has to crawlers. This referral can appear as follows:
The robots’ exclusion standard protocol allows for no regular expression (wildcards) in the strictest sense. However, it recognizes two placeholder symbols for path information:
The symbols * and $.
These are used with the Disallow directive to exclude entire websites or individual files and directories.
The "*" sign is a placeholder for character strings (string) that follow this symbol. If they support the syntax of wildcards, crawlers do not index websites that contain these character strings. Depending on the user agent, it means that the directive goes for all crawlers – even without inputting a character string.
Figure 2: Section from the robots.txt file of Amazon
Tip: If wildcards and programming are a new area for you and everything sounds too complicated, simply use the robots.txt generator from OnPage.org to create your robots.txt file.
There are obligatory requirements for the correct functioning of a robots.txt file. Before you place the file online, you must absolutely check whether the following basic rules are being met:
With the practical OnPage.org robots.txt testing tool, you can check whether your website contains a robots.txt file in just a few steps. Alternatively, you can work directly in the Google Search Console. In the main menu on the start page, you will find the sub-item robots.txt tester in the section "Crawling".
If someone else has created your web directory and you are not sure whether you have a robots.txt file at all, you will see this in the tester after entering your URL. If you get "robots.txt file not found (404)", you’ll have to first tell Google that some areas of your site should be ignored.
Figure 3: Website contains no robots.txt file
If you click on "send" in the lower right side of the robots.txt editor, a dialog field will open. Download the edited robots.txt code here by choosing "download".
Figure 4: Uploading and updating the robots.txt file
You must upload the new robots.txt file into your root directory, and you can then check whether the file is being crawled by Google by clicking on the button "View live available robots.txt". At the same time, you will communicate to Google that the robots.txt file was changed and should now be crawled.
If there is already a robots.txt file, scroll through the code to see whether there are syntax warnings or logic errors.
Figure 5: Example of a robots.txt file
Under the tester, you will see a text field in which you should enter the URL of a page on your website and click on "test".
You can also choose the user agent you would like to simulate in the drop-down list to the right of this field. By default, the menu is set on "Googlebot".
Figure 6: The Google user agent
If "approved" appears after testing the term, the site can be indexed. But if the test result shows "blocked", the given URL is blocked for the Google web crawler.
If the result isn’t what you wanted, correct the error in the file and carry out the test again. Always edit the robots.txt file on your website because changes cannot be made with the tester.
For large companies as well as for website operators of smaller sites, it is important to check whether the robots.txt is always reachable and whether or not something has changed with your content. This is possible with the help of the robots.txt monitoring from OnPage.org.
The report can found in the navigation of the module "OnPage.org monitoring".
Figure 7: OnPage.org’s robots.txt monitoring
OnPage.org pings the website’s robots.txt file every hour. In this way, it checks whether this is reachable (status 200) and whether the content of the file has changed since the previous inquiry. Likewise, the loading time of the file is examined and deviations such as timeouts are registered.
With OnPage.org monitoring, all versions of the robots.txt file are listed, including their average loading time and download errors. If you want to take a closer look at a specific version, you can initiate a detailed view by clicking on the magnifying glass on the right image border.
The correct programming and placing of the robots.txt file is of great importance for your technical search engine optimization. Even the smallest syntax errors can lead a user agent to react differently than desired. Sites that you want excluded will be crawled, or vice versa.
Consider whether you really want to exclude sites using robots.txt files. Your instructions are only guidelines for the crawlers, which may not be followed as planned. Moreover, the robots.txt file can be read incorrectly by some crawlers that specify a special syntax. Use the tips above to regularly check in make sure the file can always be reached.
Published on 03/15/2017 by Eva Wagner.
Eva is an experienced content marketer. Until May 2018 she was a member of online marketing team at Ryte. Using her creativity and the knowledge of current topics, she was responsible for the German Ryte Magazine and the Ryte Wiki. She also organized Ryte’s presence at major trade fairs such as the dmexco in Cologne.