Headless Crawling

Headless crawling is the automated browsing of the Internet and individual domains using a headless browser, which is a web browser without graphical user interface. Headless crawling includes many approaches and methods to extract, store, analyze, and further process data. Websites, web applications, and individual features of websites may also be tested automatically and checked. Headless crawling includes thematic overlaps with topics such as information retrieval, data mining, scraping, and test automation.

General information

Until recently, Google recommended the use of headless browsers to crawl dynamic websites. The operators had to provide an HTML Screenshot of their website, so that Google was able to read and assess the content therein. The so-called AJAX crawling scheme is now obsolete and no longer used. Instead, the web content is provided regardless of the technology used, including device, browser, and Internet connection, which is known as progressive enhancement.^[1] Headless crawling is essentially a part of any search engine. Web content gets browsed, but not completely rendered or displayed to a user graphically.

What happens to the detected data, is a question of approach. Nevertheless, it is assumed that Google’s search engine utilizes the capacity of headless crawling since 2004 and JavaScript is no longer a problem since October 2015. Search engines can use headless crawling to evaluate websites. Insofar as the crawler simulates a call to a website with a non-graphical interface, search engines can draw conclusions from this information and rate websites based on their behavior in the headless browser.^[2]

How it works

At the center of headless crawling is the headless browser, a program that reads web content, passes it on to other programs or displays it text-based in the form of files, lists, and matrices. These types of browsers gain access to websites by being implemented in a server infrastructure. Optionally, a virtual server or a proxy server can be used. From there, the headless browser tries to access a URL; this is the starting point of the crawling process, which is initiated with a command line or a script command.^[3] Depending on the configuration, more URLs can now be found by the browser. The contents stored there can be processed, even the issue of link positions on the website is possible. However, an API interface, which transfers the data to the processing program, is often necessary for this purpose.

What makes headless crawling special is the machine-to-machine communication (M2M). Both called URLs, as well as web content which has been found is not displayed to an end user, as is the case with conventional browsers. Instead, the headless browser forwards the retrieved data in formats that have to be defined in advance, but can be automatically processed later. If extensively implemented, a headless browser can usually handle different programming languages, scripts, and process thanks to an API that can communicate with other programs or infrastructures via HTTP request or TCP. This principle is often used to extract large amounts of data, which ultimately raises the question of how legal it is to collect and process such data. In principle, copyright, privacy agreements, and the privacy of users could be violated.^[4] The same applies to price comparison portals, search engines, and meta-search providers.

Practical relevance

Headless crawling is not only applied with search engines, but also in other use cases. Two examples:

Test automation: The testing of websites, website elements, and functions is a common use of headless crawling. In this way, broken links, redirects, interactive elements, individual components (units) and modules can be checked with regard to their function. Performance characteristics and the generation of website content from databases can be tested. With an extensive implementation, websites can be relatively comprehensively tested and above all automated. Thus test scenarios that use headless crawling, go far beyond the mere testing of a system in terms of crashes, system errors, and unwanted behavior. Testing with headless crawling is similar to acceptance testing because the headless browser can simulate the behavior of websites from a user perspective and, for example, click links.^[5] However, profound programming and scripting skills are required for this scenario. Because testing is performed either based on a customer request or with a selected test object whose rights belong to the site owner, test automation with headless crawling is not usually objectionable. Known headless browsers with framework (API, programming language support or DOM handling) are Selenium, PhatnomJS or HtmlUnit. Usually headless browsers use a layout engine, which is also integrated into conventional browsers and search engine crawlers. Examples of layout engines are Webkit, Gecko or Trident.
Web Scraping: Scraping is a crawling technique, where data is extracted and aggregated for further use. Sometimes large amounts of data from one or more sources are collected, read, and processed. Scraping may be damaging and is classified as black-hat or cracker technology in many usage scenarios. Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks use the principle of headless crawling to access a website or web application.^[6] Often some illegal methods are used, for example, to hide the IP address (IP spoofing) to distract from the actual attack on the network or infiltrate the communication between server and several clients via TCP (hijacking).

Relevance to search engine optimization

Headless Crawling is an important aspect of SEO. As already mentioned, the principle is (most likely) used by various search engines to crawl websites and web applications, even if the AJAX crawling scheme is now obsolete. Google recommends at different points of the Quality Guidelines to use a text-based browser such as Lynx to represent websites the way Google sees them. It can be assumed that the capacity of Google and other search engines can do far more than text-based browsers and what is officially communicated. Accordingly, it would make sense to learn about headless crawling in detail. Because with this principle, websites can be thoroughly tested and with this perspective SEOs can venture a look behind the scenes of the search engine operator, without losing sight of the users.

References

↑ Deprecating our AJAX crawling scheme googlewebmastercentral.blogspot.de. Accessed on 01/26/2016
↑ Just How Smart Are Search Robots? moz.com.com. Accessed on 01/26/2016
↑ Design of a Crawler for Online Social Networks Analysis wseas.org. Accessed on 01/26/2016
↑ Is Web Scraping Illegal? Depends on What the Meaning of the Word Is Is. resources.com. Accessed on 01/26/2016
↑ Headless Functional Testing with Selenium and PhantomJS code.tutsplus.com. Accessed on 01/26/2016
↑ Headless browsers: legitimate software that enables attack itproportal.com. Accessed on 01/26/2016

Web Links

[1] Deprecating our AJAX crawling scheme googlewebmastercentral.blogspot.de. Accessed on 01/26/2016

[2] Just How Smart Are Search Robots? moz.com.com. Accessed on 01/26/2016

[3] Design of a Crawler for Online Social Networks Analysis wseas.org. Accessed on 01/26/2016

[4] Is Web Scraping Illegal? Depends on What the Meaning of the Word Is Is. resources.com. Accessed on 01/26/2016

[5] Headless Functional Testing with Selenium and PhantomJS code.tutsplus.com. Accessed on 01/26/2016

[6] Headless browsers: legitimate software that enables attack itproportal.com. Accessed on 01/26/2016

[1]

[2]

[3]

[4]

[5]

[6]