One of the topics that has received a lot of attention in SEO in recent months is the question of how to deal with the Noindex directive in the robots.txt. Many SEOs use noindex for crawling and indexing control, but this will soon be a thing of the past.
On their webmaster Central Blog, Google announced on 2 July 2019 that they would no longer be using the Noindex in robots.txt from September onwards. Based on this announcement, some websites have already started to clean up their robots.txt and remove the Noindex information. However, in this article, we present a case that shows removing noindex can have immediate consequences. We find ourselves asking whether it really is a good idea for Google to remove this solution.
In SEO, there are many different opinions about how to deal with the robots.txt, and there have been many discussions about this topic over the years. Some people rely on the solution option presented by Matt Cutts in 2008, others were rather sceptical about the actual benefits of this directive.
The Noindex specification in the robots.txt (not to be confused with the Robots tag in the HTML head (Meta Robots) or HTTP Response Header (X-Robots)) works similarly to the Disallow directive. The Noindex tag is however better from preventing URLs blocked by crawling from getting into the Google index, as the Disallow function cannot prevent the indexing of URLs. In principle, you would expect that if a URL cannot be crawled, it also can't be indexed. However, experience has shown that the Disallow cannot prevent the indexing of URLs. This is also backed up by documentation from Google.
Figure 1: Exemplary Google search result without page information due to a robots.txt block
Those who have worked with the Noindex specification in the robots.txt in the past have found it can prevent such unwanted indexing. Experience has shown that this directive has removed indexed URLs from the search engine index. However, this was not a guarantee.
The Noindex in robots.txt was one of the best ways to simultaneously optimize the crawling and indexing of a website for specific use cases. Ideally, an optimal crawling and indexing should be achieved with a clean technical implementation, rather than having signals in the source code. However, experience shows that this can not usually be achieved within a short space of time.
Therefore, it is regrettable that Google will withdraw the Noindex directive in robots.txt in September. Google suggests the following alternatives:
Figure 2: Alternatives to the Noindex in the robots.txt according to Google
In my opinion, the above options, which Google stated in its Webmaster Central Blog post, are not very comprehensible or not communicated very well:
Meta-Robots-Tag “noindex”: This ensures that the URLs are not included in Google’s index, or are removed from it. However, the URLs can still be crawled and in the worst case can influence the crawl budget negatively. This option is therefore out of the question in many cases.
404 or 410 HTTP status code: If a URL should no longer exist, the use of a 404, or even better, 410 HTTP status code, is a possibility. However, it is not ideal and it does not make sense to present an error code as an alternative to the noindex in the robots.txt.
Password protection: This recommendation also doesn’t work 99% of the time. For a website in its staging environment, of course the best set up to avoid indexing is password protection. However, Google should not recommend this as an alternative without providing an example.
URL removal function: This function in the Google Search Console is in principle a great tool for removing URLs from the index at short notice. However, the tool would have to be expanded even further so that unwanted URLs can be banned from search results.
Disallow in robots.txt: As described above, the current Disallow specification does not completely prevent the indexing of URLs. Hopefully, Google will improve its Disallow command with its statement “we aim to make such pages less visible in the future”.
Even if you may not be satisfied with the existing alternatives, as a website operator you will have to adjust your robots.txt in the next few weeks and do without the Noindex specification. This is very important, because if the robots.txt cannot be interpreted properly, this could cause all other directives to be ignored by the crawler. Whether this also applies to Google is not yet known.
If you follow the order of the robots.txt instructions, i.e. first Disallow, then Allow, and just a noindex at the end, you can certainly consider keeping the current status quo, if you continue to give this directive to other search engine crawlers - if they work with the Noindex in the robots.txt. Either way, you should definitely set up appropriate monitoring to keep an eye on changes in crawling and indexing.
The crucial question that will arise until then: What actually happens if you already remove the Noindex directives in robots.txt and rely only on the Disallow? One of our customers adjusted the robots.txt just one week after Google published the message, just keeping the previously existing Disallow duplicate entries.
Relatively quickly after the change in robots.txt, Google began to include some URLs that had previously been blocked by the Noindex specification in the index. The Disallow directive still blocks these URLs from being crawled, but it is not sufficient enough to prevent indexing in Google. This case showed us once again how effective the Noindex in robots.txt actually is!
This case shows the importance of the Noindex tag, therefore it’s all the more regrettable that we will not be able to work with it from September onwards. Hopefully, Google will transfer the advantages of the Noindex in the robots.txt to the Disallow specification in the future.
As you can see from the presented case, adjusting the robots.txt early to avoid the existing noindex directives is not recommended. In the worst case scenario, URLs blocked with the Disallow directive will end up being indexed.
Even though we don't know what changes will happen exactly on September 1st, 2019, we recommend that you adjust your robots.txt as close as possible to the September date. We also recommend an index monitoring of the URLs blocked in robots.txt.
Furthermore, the solution for an optimal balance between crawling and indexing continues to be a technically clean implementation at website level. The problem with the example presented here is that there are a number of signals within the source code that cause GoogleBot to want to crawl or index the URLs in question. This does not necessarily mean href- or form-links.
Logfile analyses have shown that URL snippets in scripts, comments or self-defined HTML attributes can also cause Google to crawl and index these URL patterns. This makes it clear once again that clean programming is important in SEO and that you should not rely solely on robots.txt as a means of crawling control. In order to be technically optimally positioned, measures such as link masking, PRG patterns and the like are therefore still necessary steps.
Published on Jul 23, 2019 by Darius Erdt