Deepbot
Deepbot is a part of the Googlebot web crawler, which crawls the Internet at regular intervals to add as much content and sites to the search index as possible (indexing). The Deepbot focuses on the depth of websites and follows all links known to it through the existing index. The Deepbot automatically moves from link to link, captures various data (see ranking factors) and feeds it into the complex system. All content which the Deepbot encounters with this deep crawling process is assigned to the index step by step. This database is the basis for the algorithm calculations which ultimately result in the ranking. The processes directly affect the ranking of sites in the SERPs because the system uses new records, which is also known as Google Dance and is tantamount to a data refresh. The Deepbot currently visits websites at an interval of about one month and crawls millions of web documents on the Internet for about a week.
General information
Websites can be represented as tree structures or graphs, which can be searched automatically by a computer program. The program, which is also described with the terms bot, spider or crawler, scans the structure of the website and the content located at the branches (links). A home page is a root node from which several subpages are accessible. The links which direct to these subpages are called edges. The computer program consists to a large extent of algorithms, which describe the possible routes in these structures and regulate what data is significant for possible changes in the ranking. The two parts of the Googlebot, the Freshbot and Deepbot, are currently implemented at the infrastructure level (see Google Caffeine).[1]
How it works
The Googlebot basically consists of two components:
- Freshbot: The Freshbot focuses on new content and thus on websites that update their content in very short intervals. Online magazines, news sites or blogs, for example.
- Deepbot: The Deepbot examines the depth structures of websites and collects as many links as possible for the index. The Deepbot harvests links and follows them as far as it can.
While a Freshcrawl targets sites with constantly changing content, the Deepcrawl is characterized by the fact that all the subpages of a website are read. The crawl goes into the depth structure of the website. However, the subpages do not necessarily have to provide new content, they are only tracked in their entirety by the Deepbot and listed on an inverted index. The goal of the Deepbot is to get a vertical overview of the structure and the content of a website in order to be able to display results relevant to search queries later within a very short Time. Thanks to the index structure, Google can access certain datasets that are triggered by a search query within milliseconds.
The Deepbot partly gets instructions from the other part of the Google crawler. The Freshbot is constantly crawling the Internet and adding links to the index that the Deepbot can then search. If this new content is indexed, there may be fluctuations in the ranking, which experts call Everflux effect. This is also a data refresh and not an algorithm update, as Matt Cutts once stressed.[2] The final indexing results will settle over time after Google has collected the data for the index by regular Deepcrawls and the Freshcrawl updates the data continuously. The functioning principle of this search for links is called an incremental search. Small steps improve the system continuously. The Deepbot and Freshbot are also simultaneously active at various points in the Internet infrastructure.
Relevance to practice
Because each crawl is a communication between a client (bot, crawler, spider) and a server, these processes can at least be partially reconstructed. Once a bot accesses a website, the server registers this access and notes it in the log files. The IP address and the user agent indicate which bot it is. The bot acts like a browser without a graphical user interface. The term headless crawling has been become prevalent for this action. How the Googlebot sees a website can be viewed using the “Fetch as Google” tool.
The Googlebot can also be verified by performing a DNS search in both directions. This is advisable, for example, to exclude spambots or spoofing. A long-term blocking of certain IP addresses is out of the question because Google can change the address ranges of the Googlebots.[3]
- Inverted DNS lookup: The hostname and the IP address from the server log files can be used to retrieve the domain name.
- It is now checked if the domain name googlebot.com or google.com appears in the logfile.
- Regular DNS lookup: With the command host and the retrieved domain name from step one, the IP address of this domain name can now be output.
> host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. > host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
If the data matches, it is indeed the Googlebot. Other entries of the log files can also be examined according to this scheme and possibly excluded. There are several ways to control crawling and indexing. The robots.txt file should be considered a loose instruction for crawlers, or the nofollow metatag, which tells the crawler to not follow this link. In general, it is beneficial to submit a sitemap to search engines to give them an overview of the structure of the website and its content.
Relevance to search engine optimization
Only the two components Deepbot and Freshbot as well as their special procedures allow the inclusion of web documents in the Google search index. That way, websites and their subpages are made available to Google users and all content is updated as promptly as possible. Different data from the following subject areas are used for crawling and indexing:
- information retrieval,
- data mining,
- web scraping,
- as well as knowledge representation in information systems.
However, it can be assumed that Google keeps these procedures, methods, and the infrastructure confidential. The way Google analyzes and evaluates websites is a key part of the business model of this search engine giant and is constantly evolving according to the latest research. Meanwhile the technology is now advanced to an extent that experts speak of instant indexing.
However, these procedures also require a certain bandwidth of the Internet connection since HTTP communication is necessary. A lot of accesses by bots can increase server utilization and the resources for real users are sometimes inadequate during these periods. Therefore, it is recommendable to cap crawling frequency in some cases. The number of queries per second can be limited by webmasters, so that crawling does not take up too many resources.[4]
Moreover, webmasters and analysts can receive incorrect data in Google Analytics if the fine settings for crawling and indexing have not been made. The exclusion of certain bots from data views is advisable to distinguish the real user visits from those made by bots, for example. Generally, the search engines can be told in different ways which websites and content they should crawl and index and which not.
References
- ↑ Our new search index: Caffeine googleblog.blogspot.de. Accessed on 09/15/2016
- ↑ Explaining algorithm updates and data refreshes mattcutts.com. Accessed on 09/15/2016
- ↑ Verifying Googlebot support.google.com. Accessed on 09/15/2016
- ↑ Change Googlebots Crawl Rate support.google.com. Accessed on 09/15/2016