Scraping


Scraping usually refers to screen scraping or, more precisely, “web scraping.” In this practice, the content of websites is extracted, copied, and stored manually or with the aid of software and, if necessary, reused in a modified version on your website. If used in a positive way, web scraping presents a possibility to add more value to a website with content from other websites. If misused, however, scraping violates copyrights and is considered spam.

Techniques

Scraping can be done with different techniques. The most prevalent are briefly described here:

  • Using http manipulation, contents of static or dynamic websites can be copied via http-request
  • With the Data Mining method, different content is identified by the templates and scripts in which they are embedded. The content is converted using a wrapper and made available to a different website. The wrapper acts as a kind of interface between the two systems.
  • Scraping tools perform multiple scraping tasks both automated and manually controlled. The bandwidth ranges from copied content to copied structures or functionalities.
  • HTML Parsers, as used for Browsers, retrieve data from other websites and convert it for other purposes.
  • Manual copying of content is often referred to as scraping. The bandwidth ranges from simple copying of texts to copying of entire source code snippets. Manual scraping is often used when scraping programs get blocked, for example, with the robots.txt.
  • Scanning of microformats is also part of scraping. With the continually advancing development of the semantic web, microformats are popular components of a website.

Common applications

Scraping is used for many purposes. Here are just a few examples:

  • Web analytics tools: retrieve ranking on Google and other search engines, and prepare the data for their customers. In 2012, this area was heavily debated when Google blocked some services.
  • RSS services: content provided via RSS feeds is used on other websites
  • Weather data: many websites such as travel portals use weather data from large weather websites to increase their own functionality
  • Driving and flight plans: for example, Google utilizes relevant data from public transport services to supplement the itinerary function in Google Maps

Scraping as a spam method

Within the context of content syndication, content from websites can be distributed to other publishers. Scraping can, however, often violate these rules. There are websites that consist only of content which has been scraped from other websites. Very often you can find pages on the web containing information that has been copied directly from Wikipedia without showing the content source. Another case of spam scraping is that online stores copy their product descriptions from successful competitors. Even the formatting is often kept the same.

It is important for webmasters to know if their content is being copied by other websites. Because in the extreme case, Google may charge the author with scraping, which could then lead to the scraped domain being lowered in ranking on the SERPs. Alerts can be set up in Google Analytics to monitor if content is being copied by other websites.

Google as scraper

Search engines such as Google use scraping to enhance their own content with relevant information from other sources. Google, in particular, uses scraping methods to populate its OneBox or to make the KnowledgeGraph. Google is also scraping the Web to add entries to Google Maps that have not yet been claimed by companies. Moreover, Google collects relevant data from websites that have made microformats of their content available in order to create rich snippets.

How to prevent scraping

There are several simple measures, webmasters can use to prevent their websites from being affected by scraping:

  • Blocking of Bots with the robots.txt
  • Inserting captcha queries on the site
  • Use of CSS to display phone numbers or mail addresses
  • Reinforce firewall rules for the server

Web Links