AJAX Crawling Scheme


The AJAX crawling scheme is a method by which Google and other search engines crawl websites that provide dynamically generated content. Google has used this procedure since 2009. However, on October 15, 2015, Google announced that this crawling scheme was no longer recommended and deemed obsolete (depreciated). Instead, Progressive Enhancement and the possibilities of HTML5 (history API) is meant to be utilized to ensure accessibility of certain content to crawlers.

600x400-Ajax-01.png

General information

AJAX-based Web applications generate content to be displayed with asynchronous data transfer between the browser and server. This makes it possible to perform HTTP requests using JavaScript and XML (HTTP Request) during a browsing session in order to load content from the server or from a database, but without having to load the HTML site again. This means that parts of the content and the user interface can be loaded by the browser without the need of further HTTP communication between server and browser, as is the case with static HTML webpages. Thus the browser can get data that was already entered in a form, or data can be changed if they have to be up to date (such as dates).

The connection between client and server does not get interrupted. Rather, the user initiates the process of dynamic content creation by clicking an object on the site. This action causes a script to be executed that is interposed between the HTTP communication of the server and client and loads previously selected content. The AJAX engine detects the call of the script (asynchronous request) and sends an XML request to the server or a database to find the content. The selected items are then loaded dynamically by the script on the website or executed.

How it works

The AJAX crawling scheme ensures that dynamically generated content can be read by crawlers, bots, or spiders. Since these programs which constantly analyze the global Internet cannot interpret dynamically generated web content or scripts, the scheme attempts to store an HTML snapshot of the current content on the server. Content with HTML markup will be readable even for text-based crawlers, because it basically exists in two different versions. Several steps are necessary to prepare a site for the crawling scheme:[1]

  • The first step is to note on the website that the AJAX crawling scheme is supported. A conventional website might have the following URL:
http://www.sample-domain.com/ajax.html#key=value

The exclamation point symbol (!) is noted on the sample domain with dynamically generated content. It is followed by the pound sign (#). This is the point where the hash fragments begin, in other words, anything that would be generated to handle a dynamic query (usually attributes and value pairs or Ids). The type of URL is referred to as AJAX URL. The combination of ! and # is often called a hashbang.

http://www.my-domain.com/ajax.html#!key=value

This notation informs the crawler that the website supports the AJAX crawling scheme.

  • As the second step, a different URL format is created for each URL that is dynamically generated because the server must output the correct URL so that the HTML snapshot is referenced. Therefore, the crawler transmits this URL
    http://www.my-domain.de/ajax.html#!key=value
    in a different format:
    http://www.meine-domain.de/ajax.html?_escaped_fragment_=key=value
    . Only in this way does the server know that the crawler is requesting the content for the URL
    http://www.my-domain.de/ajax.html#!key=value
    . Moreover, the server knows that the crawler must return an HTML snapshot. The original URL format would be the same and no crawlable content would get sent.
  • HTML snapshots are created as the third step. An HTML Snapshot is created and stored on the server for each dynamically generated URL. It is a kind of crawler-readable copy of the dynamically generated content, which is provided by the execution of JavaScripts. Various options exist for this purpose depending on the technology or scripting language used. Browsers without a user interface such as HtmlUnit can be utilized. Even tools like crawl ajax or watij.com may be helpful to create an HTML snapshot. These options are particularly beneficial when a lot of content is generated by JavaScript. If technologies such as PHP or ASP.NET are used, the existing source code can be applied to generate the HTML on the server side or replace the JavaScript elements with static code. The most widely used method, however, is to provide an offline static HTML page for each AJAX URL.
  • The search engine indexes the HTML snapshot for each URL, but not the dynamic content that it cannot read. AJAX URLs are displayed in the SERPs. These are URLs with a hash fragment such as "key = value".

Importance for search engine optimization

It is a well-known fact that Google has been working for some years to get the Googlebot able to recognize JavaScript elements. The announcement that the AJAX crawling scheme is no longer recommended should be seen as a step forward in the interpretation of scripts by the crawler. This new recommendation is especially significant against the background that more and more content is responsive and device-dependent.

But many webmasters still use AJAX-based applications. The key points of the most frequently asked questions are summarized here:[2]

  • Older websites that use the AJAX crawling scheme will continue to be indexed by Google in the future as well. As a rule, however, the crawler uses the URL format with the hash fragments #!.
  • Websites that no longer use the AJAX crawling scheme, will not be considered site shifts (cloaking). The URL format including “_escaped_fragments_” should nevertheless be avoided in the implementation of new web projects or relaunches.
  • If websites with a JavaScript framework are pre-rendered for the Googlebot, then the content for users should be pre-rendered as well in order to avoid cloaking.

Moreover, Google recommends progressive enhancement in the creation of websites. The basic idea is that the presentation of content through the browser should be independent of the technology used. Certain features of a website are carried out with outsourced components code (JavaScript and CSS) and these elements will be executed depending on the client or device. For example, the history API can be used in HTML5. The new process also ensures that the crawler gets access to all necessary files. Because the Googlebot can now render websites and add them to the index, even if JavaScript is used. However, it must be ensured that JavaScript, CSS, and image files are not blocked by the robots.txt file. Generally, accessibility should be ensured in accordance with the quality guidelines.[3]

References

  1. AJAX Crawling (Deprecated) developers.google.com. Accessed on 06/11/2015
  2. Deprecating our AJAX crawling scheme googlewebmastercentral-de.blogspot.de. Accessed on 11/06/2015
  3. Webmaster Guidelines support.google.com. Accessed on 11/06/2015

Web Links