Googlebot needs to crawl your website before it’s displayed to users in the search results. Although this an essential step, it doesn’t get as much attention as many other topics. I think it’s in part because Google doesn’t share a great deal of information on exactly how Googlebot crawls the web.
Seeing many of our clients struggle with getting their websites properly crawled and indexed, together with Bartosz Góralewicz, we dug through a pile of Google’s patents on crawling, rendering, and indexing in order to better understand the whole process.
Some of our findings were extremely surprising, while some others confirmed our previous theories.
Here are 5 things I learned that you may not know about how Googlebot works.
Googlebot won’t visit every URL it finds on the web. The bigger a website is, the more at risk it is of some of its URLs not getting crawled and indexed.
Why won’t Googlebot simply visit every URL it can find on the web? There are two reasons for that:
The mechanism for choosing which URLs to visit is described in Google’s patent "Method and apparatus for managing a backlog of pending URL crawls":
"The pending URL crawl is rejected from the backlog if the priority of the pending URL crawl fails the priority threshold"
"Various criteria are applied to the requested URL crawls, so that less important URL crawls are rejected early from the backlog data structure."
These quotes suggest that Google is assigning a crawling priority to every URL, and may reject crawling some URLs that don’t meet the priority criteria.
The priority assigned to URLs is determined by two factors:
"The priority can be higher based on the popularity of the content or IP address/domain name, and the importance of maintaining the freshness of the rapidly changing content such as breaking news. Because crawl capacity is a scarce resource, crawl capacity is conserved with the priority scores."
What exactly makes a URL popular? Google’s "Minimizing visibility of stale content in web searching including revising web crawl intervals of documents" patent defines the URL’s popularity as a combination of two factors: view rate, and PageRank.
PageRank is mentioned in this context in other patents too, such as Scheduler for search engine crawler.
But there’s one more thing you should know about. When your server responds slowly, the priority threshold that your URLs have to meet goes up.
"The priority threshold is adjusted, based on an updated probability estimate of satisfying requested URL crawls. This probability estimate is based on the estimated fraction of requested URL crawls that can be satisfied. The fraction of requested URL crawls that can be satisfied has as the numerator the average request interval, or the difference in arrival time between URL crawl requests."
To summarize, Googlebot may skip crawling some of your URLs if they don’t meet a priority threshold that’s based on the URL’s PageRank and the number of views it gets.
This has strong implications for every big website.
If a page is not crawled, it won’t be indexed and won’t be displayed in the search results.
Google wants the search results to be as fresh and up-to-date as possible. This is only possible when there’s a mechanism in place for recrawling content that’s already indexed.
In the "Minimizing visibility of stale content in web searching" patent, I found information on how this mechanism is structured.
Google is dividing pages into tiers based on how often the algorithm decides they need to be recrawled.
"In one embodiment, documents are partitioned into multiple tiers, each tier including a plurality of documents sharing similar web crawl intervals."
So if your pages aren’t crawled as often as you’d like, they are most likely in a tier of documents with a longer crawl interval.
However, do not despair! Your pages don’t need to stay in that tier forever - they can be moved.
Every time a page is crawled is a chance for you to show that it’s worthy to be recrawled more frequently in the future.
"After each crawl, the search engine re-evaluates a document's web crawl interval and determines if the document should be moved from its current tier to another tier."
It’s clear that if Google sees a page is changing frequently, it could be moved to a different tier. But it’s not enough to change some minor cosmetic elements - Google is analyzing both the quality and quantity of changes you make to your pages.
According to the Minimizing visibility of stale content in web searching including revising web crawl intervals of documents patent, Google doesn’t reindex a page after every crawl.
"If the document has changed materially since the last crawl, the scheduler sends a notice to a content indexer (not shown), which replaces index entries for the prior version of the document with index entries for the current version of the document. Next, the scheduler computes a new web crawl interval for the document based on its old interval and additional information, e.g., the document's importance (as measured by a score, such as PageRank), update rate and/or click rate. If the document's content has not been changed or if the content changes are non-critical, there is no need to re-index the document."
I’ve seen it in the wild multiple times.
Moreover, I did some experiments on existing pages at Onely.com. I noticed that if I was changing only a smart part of the content, Google was not reindexing it.
If you have a news website and update your posts frequently, check if Google reindexes it quickly enough. If that’s not the case, you can be sure that there’s unused potential for you in Google News.
In the previous quote, did you notice how click rate was mentioned?
"Next, the scheduler computes a new web crawl interval for the document based on its old interval and additional information, e.g., the document's importance (as measured by a score, such as PageRank), update rate and/or click rate"
This quote suggests that click rate influences how often a URL is crawled.
Let’s imagine we have two URLs. One is visited by Google users 100 times a month, another one is visited 10000 times per month. All other things being equal, Google should revisit the one with 10000 visits per month more frequently.
According to the patent, PageRank is an important part of this, too. This is one more reason for you to make sure you’re properly using internal linking to connect various parts of your domain.
We just covered how, according to Google’s patents, PageRank heavily affects crawling.
The first implementation of the PageRank algorithm was not sophisticated, at least judging by current standards. It was relatively straightforward - if you got a link from an *important* page, you would rank higher than other pages.
However, the first implementation of PageRank was released over 20 years ago. Google has changed a lot since then.
I found interesting patents, such as Ranking documents based on user behavior and/or feature data, showing that Google is well aware that some links on a given page are more prominent than others. And then, Google may treat these links differently.
"This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include "Terms of Service" links, banner advertisements, and links unrelated to the document."
So Google is analyzing links based on their various features. For instance, they may look at the font size and link placement.
"For example, model generating unit may generate a rule that indicates that links with anchor text greater than a particular font size have a higher probability of being selected than links with anchor text less than the particular font size. Additionally, or alternatively, model generating unit may generate a rule that indicates that links positioned closer to the top of a document have a higher probability of being selected than links positioned toward the bottom of the document."
It even seems that Google may create rules for assessing links on a website level. For instance, Google can see that links under “More Top Stories” are clicked more frequently so it can put more weight on them.
"(...) model generating unit may generate a rule that indicates that a link positioned under the "More Top Stories" heading on the cnn.com web site has a high probability of being selected. Additionally, or alternatively, model generating unit may generate a rule that indicates that a link associated with a target URL that contains the word "domainpark" has a low probability of being selected. Additionally, or alternatively, model generating unit may generate a rule that indicates that a link associated with a source document that contains a popup has a low probability of being selected."
As a side note, in a conversation with Barry Schwartz and Danny Sullivan in 2016, Gary IIIyes confirmed that Google labels links, such as Penguin-impacted or footer.
"Basically, we have tons of link labels; for example, it’s a footer link, basically, that has a lot lower value than an in-content link. Then another label would be a Penguin real-time label."
As you can see, crawling is far from being a simple process of following all links that Googlebot can find. It’s really complicated, and it has a direct impact on every website’s search visibility. I hope that this article helped you understand crawling a little better and that you’ll be able to use this knowledge to improve how Googlebot crawls your website and rank better as a consequence.
Tomek and his team are always investigating into exciting new topics around all things search. You can find articles like this and more over on their Onely blog.
Find optimization potential on your website
Published on 08/25/2020 by Tomek Rudzki.