Crawl Budget


Crawl budget is defined as the maximum number of pages that Google crawls on a website.

Definition

Google itself defines how many subpages it crawls per URL. This is not the same for all websites, but according to Matt Cutts, is determined primarily based on the PageRank of a page. The higher the PageRank, the greater the crawl budget. The crawl budget also determines how often the most important pages of a website get crawled and how often an in-depth crawl is executed.

<htmlet>crawlbudgetslideshare</htmlet>

Differentiation of the index budget

The term index budget is different from a crawl budget. It determines how many URLs can be indexed. The difference becomes apparent when a website contains multiple pages that return a 404 error code. Each requested page counts on the crawl budget but if it cannot be indexed due to an error message, the index budget is not fully utilized.

Problem

The crawl budget poses a problem for larger websites with many subpages. Specifically, not all subpages will be crawled, but only a portion of them. Accordingly, not all subpages can be indexed. This in turn means that site operators might by losing traffic because relevant pages weren’t indexed.

Importance for search engine optimization

There is a whole section of search engine optimization specifically devoted to this situation, with the aim is to direct the Googlebot, so that the existing crawl budgets gets used very wisely and high quality pages that are of particular importance for the website operator get indexed. Pages which are of minor importance must be identified first. In particular, that would include pages with poor content or little information, in addition to faulty pages which return a 404 error code. These pages must be excluded from the crawl so that the crawl budget remains available for the better quality pages. Subsequently, the important subpages have to be designed in such a way that they are crawled by spiders as a priority. Possible actions as part of crawl optimization include:

  • Implementation of a flat page architecture in which the subpage paths are as short as possible and requires only a few clicks
  • Internal linking of pages with a lot of backlinks to pages that are supposed to be crawled more frequently
  • Very good internal linking of the most important pages
  • Exclusion of unimportant pages from crawling through the robots.txt (such as log-in pages, contact forms, images)
  • Exclude crawling by using metadata (noindex, nofollow)
  • Offering an XML sitemap with a URL list of the most important subpages

If the portfolio of crawled and indexed pages is improved through crawl optimization, the ranking may be improved as well. Pages with a good ranking are crawled more frequently, which in turn brings benefits.

An informative lecture on “Crawl Budget Best Practices” by Jan Hendrik Jacob Merlin at the SEOkomm 2015 can be found here.

Web Links