Headless crawling is the automated browsing of the Internet and individual domains using a headless browser, which is a web browser without graphical user interface. Headless crawling includes many approaches and methods to extract, store, analyze, and further process data. Websites, web applications, and individual features of websites may also be tested automatically and checked. Headless crawling includes thematic overlaps with topics such as information retrieval, data mining, scraping, and test automation.
Until recently, Google recommended the use of headless browsers to crawl dynamic websites. The operators had to provide an HTML Screenshot of their website, so that Google was able to read and assess the content therein. The so-called AJAX crawling scheme is now obsolete and no longer used. Instead, the web content is provided regardless of the technology used, including device, browser, and Internet connection, which is known as progressive enhancement. Headless crawling is essentially a part of any search engine. Web content gets browsed, but not completely rendered or displayed to a user graphically.
At the center of headless crawling is the headless browser, a program that reads web content, passes it on to other programs or displays it text-based in the form of files, lists, and matrices. These types of browsers gain access to websites by being implemented in a server infrastructure. Optionally, a virtual server or a proxy server can be used. From there, the headless browser tries to access a URL; this is the starting point of the crawling process, which is initiated with a command line or a script command. Depending on the configuration, more URLs can now be found by the browser. The contents stored there can be processed, even the issue of link positions on the website is possible. However, an API interface, which transfers the data to the processing program, is often necessary for this purpose.
What makes headless crawling special is the machine-to-machine communication (M2M). Both called URLs, as well as web content which has been found is not displayed to an end user, as is the case with conventional browsers. Instead, the headless browser forwards the retrieved data in formats that have to be defined in advance, but can be automatically processed later. If extensively implemented, a headless browser can usually handle different programming languages, scripts, and process thanks to an API that can communicate with other programs or infrastructures via HTTP request or TCP. This principle is often used to extract large amounts of data, which ultimately raises the question of how legal it is to collect and process such data. In principle, copyright, privacy agreements, and the privacy of users could be violated. The same applies to price comparison portals, search engines, and meta-search providers.
Headless crawling is not only applied with search engines, but also in other use cases. Two examples:
Headless Crawling is an important aspect of SEO. As already mentioned, the principle is (most likely) used by various search engines to crawl websites and web applications, even if the AJAX crawling scheme is now obsolete. Google recommends at different points of the Quality Guidelines to use a text-based browser such as Lynx to represent websites the way Google sees them. It can be assumed that the capacity of Google and other search engines can do far more than text-based browsers and what is officially communicated. Accordingly, it would make sense to learn about headless crawling in detail. Because with this principle, websites can be thoroughly tested and with this perspective SEOs can venture a look behind the scenes of the search engine operator, without losing sight of the users.