Crawling and indexing are two requirements for ensuring that your website is listed in the search results.
This is done by the Googlebot. However, the bot has only a limited crawl budget available. To ensure that a website is optimally crawled and indexed, the crawl budget should be optimally spent.
The crawl budget is defined by Google as the sum of crawling frequency and crawling demand. The budget itself thus consists of a determined number of URLs that the Googlebot can and wants to crawl. Gary Illyes, Google, delivers a more exact definition of the crawl budget in his well-regarded article "What the Crawl Budget Means for the Googlebot", which was published on Google Webmaster Central on 16 January 2017. Tips and explanations are given one after another.
Googlebot refers to a program that collects documents in the web, follows URLs, and indexes the web documents of the sites visited. In general, the work of the Googlebot is based on four main steps:
1. Google finds a URL, for example via internal links, an XML sitemap, or detailed links.
2. A list is created from those links and the individual URLs are prioritized for crawling.
3. Finally, the Googlebot is provided with the so-called "crawl budget". This determines how quickly the URLs in a website can be crawled.
4. A program known as a "scheduler" controls the Googlebot and allows the URLs to be processed according to their priority and the crawl budget.
This entire process takes place continuously. This means that more and more URLs are being placed on the list while the Googlebot is crawling and indexing URLs. This means that the crawl budget is readjusted every time.
It is important that the Googlebot does not exhaust its crawl budget. Because, in addition to crawling frequency, the crawling demand also plays a role. If Google doesn’t prioritize certain URLs, it might not crawl them, thus freeing up more resources for other URLs.
Before websites can be ranked, they must first be crawled and indexed. They must be visited by the Googlebot before they will appear in the search results.
So, the webmaster must ensure that the URLs can be found. Moreover, Google has to think the URL is valuable enough to warrant a high priority on the crawling list. Google’s own priority, for example, is to crawl less-visited and low-content sites less frequently and extensively than high-quality sites.
Nevertheless, it is important to note that Google can crawl sites with fewer than 1,000 URLs extensively without a problem, independent of the crawl budget which it has available. Therefore, for sites with more than 1,000 URLs, it is even more important that all content and URLs are updated. Because even though Google will allow the largest possible crawl budget, it will nevertheless concentrate on the main, most-visited URLs.
The Googlebot is limited by, among other things, the so-called "crawl rate limit" when it crawls URLs. The Googlebot itself sets this limit. It is assumed that it adjusts the suitable crawling rate according to server answers and possible error messages due to excessive simultaneous or fast inquiries. The extent of this limit depends on both of the following factors:
Google has decided that the crawl budget is 10 simultaneous connections and 3 seconds between inquiries. In this case, Google can crawl 200 URLs in a minute.
Change settings on the Google Search Console: Webmasters can control the crawl rate limit directly via the Google Search Console. In website settings, you can choose a faster or slower crawling.
Figure 1: Setup Google crawling rate via the search console.
It is important not to choose a crawl frequency that is too high so that the server does not slow down. Google does not specify how long the Googlebot is actually on the site.
Optimize server speed: Independent from the Search Console settings, the webmaster must ensure that the server responds quickly. In this way, the crawling rate can be significantly improved. Google recommends setting the answer time at under 200 milliseconds. This, however, does not mean the "page speed". The server speed depends on the reaction time of the server and the possible number of simultaneous connections. The loading time of the website nevertheless depends on further factors such as the source code, or the scripts and CSS data.
Check for server errors: In the Google Search Console, server errors in crawling can be checked in a separate report. (Crawling -> Crawling errors) Here, you can see the errors, including the appropriate status code.
Figure 2: Server errors can be shown via the Search Console from Google.
The crawling rate of a website by the Googlebot is limited by technological boundaries. But even without these boundaries, the Googlebot can crawl far fewer sites than the limit provides for. The so-called "crawl demand" is responsible for this. Briefly, the Googlebot decides whether it is worth it to crawl a website or whether the crawl budget should instead be saved.
In the above-named blog post on the crawl budget, Google says that more heavily visited sites, for example, are crawled more often. Prioritization also play a role in deciding how high the crawl demand is. The "scheduler" classifies URLs on its list according to priority. Here are some possible gradations:
You can find a precise overview of the Googlebot's requests in the evaluation of server log files.
Avoid abandoned sites: Abandoned sites are URLs that cannot be reached via the website through its internal linking. They are just as useless for the Googlebot as they are for users.
List URLs in an XML sitemap: With the help of an XML sitemap webmasters can deposit all relevant URLs of a domain into the Google Search Console. In this way, the Googlebot can recognize which URLs are available and can hand them over to the scheduler.
Use robots.txt: With the help of the robots.txt file the crawling of all important website areas can be facilitated for the Googlebot. Using the robots.txt, for example, the crawling and indexing of contact forms can be avoided.
Check the sites cache: By using Google Site Search, individual URLs of a domain can be called up. By clicking on "cache" you can check when the site was last recorded in the index. If the cache was set up a long time ago and important contents on the website have changed, the URL can also be manually sent to the index via the Search Console.
Figure 3: Send URLs to the Google index..
Check faceted navigation: A faceted navigation can generate countless URLs through filer possibilities. These "filter URLs" are mostly of little value for the Googlebot. Therefore, faceted navigation frequently reduces the crawl budget. To avoid this, the structure of this navigation should be checked and defined as precisely as possible. Thus, for example, superfluous URLs can be provided with a canonical tag that points to the "original site". Likewise, it is possible to insert a "Noindex,follow" metatag into the <head> area of the unnecessary URL. Using the parameter tool of the Google Search Console, likewise, search parameters in the URLs can be excluded from crawling and indexing.
Avoid endless URLs: This type of URL can arise from site-wide search functions as well as through "further" links on the site. Exclusion of internal search results can bring significant savings in the crawl budget.
Use 404 error sites: To avoid the endless crawling of soft 404 sites, URLs not available should give the 404 code (not found). In this way, you can stop the Googlebot from further these URLs and thereby protect your the crawl budget.
The Googlebot only has a limited amount of time to crawl your site. You can improve the crawling by remedying technical errors. At the same time, it is important that Google recognizes a crawl demand and this is the point where the crawl budget becomes a core topic in terms of search engine optimization. Because, after all, the quality of your website determines how frequently the Googlebot visits your website. Through unique and high-quality content, you can make sure the crawl budget is used in the most efficient way possible.
Published on 03/29/2017 by Eva Wagner.
Eva is an experienced content marketer. Until May 2018 she was a member of online marketing team at Ryte. Using her creativity and the knowledge of current topics, she was responsible for the German Ryte Magazine and the Ryte Wiki. She also organized Ryte’s presence at major trade fairs such as the dmexco in Cologne.