A crawler is a computer program that automatically searches documents on the Web. Crawlers are primarily programmed for repetitive actions so that browsing is entirely automated. Search engines use crawlers most frequently to browse the WWW and build an index. Other crawlers however search different types of information such as RSS feeds and e-mail addresses. The term crawler comes from the first search engine for the Internet, the Web Crawler. Synonyms are also “Bot” or “Spider.” The most well known webcrawler is the Googlebot.
In principle, a crawler is like a librarian. It's looking for information on the Web, which it assigns to certain categories, and then indexes and catalogues it so that the crawled information is retrievable and can be evaluated. While the librarian works self-sufficiently and assigns tasks to his team, the crawler differs in that it is not independent.
The operations of these computer programs need to be established before such a crawl is initiated. Every order is thus defined in advance. The crawler then executes these instructions automatically. Classically, an index is created with the results of the crawler, which can be accessed through output software.
What information a crawler will gather from the Web depends on the particular instructions. Graphics which visualize the link relationships that have been uncovered by a crawler:
Graphic illustration of a crawler (source: neuroproductions.be)
The classic goal of a crawler is to create an index. Thus crawlers are the basis for the work of search engines. They first scour the Web for content and then make the results available to users. Focused crawlers, for example, focus on current, content-relevant websites when indexing.
But web crawlers are also used for other disciplines:
Unlike a scraper, a crawler only collects and prepares data. Scraping is, however, a black hat technique, which aims to copy data in the form of content from other sites to place it that way or a slightly modified form of it on one’s own website. While a crawler mostly deals with metadata that is not visible to the user at first glance, a scraper extracts tangible content.
If you don’t want certain crawlers to browse your website, you can exclude their user agent using robots.txt. But that cannot prevent content from being indexed by search engines. The noindex meta tag or the canonical tag serves better for this purpose.
Webcrawlers like the Googlebot achieve their purpose of ranking websites in the SERP through crawling and indexing. They follow permanent links in the WWW and on websites. Per website, every crawler has a limited timeframe and budget available. Through the optimization of the website structure, such as navigation and file size, website operators can utilize the crawl budget of the Googlebot better. At the same time, the budget increases through a variety of inbound links and a strongly visited website. The important instruments needed to control Crawlers such as the Googlebot are the robots.txt data as well as the XML sitemap stored in the Google Search Console. In the GSC it can be texted whether all relevant areas of a website can be reached and indexed by the Googlebot.