Hadoop


Apache Hadoop is an open source, scalable framework that enables the storage and processing of large volumes of data on the order of petabytes in distributed computing systems, i.e. computer clusters. Companies such as Facebook, AOL, Adobe, Amazon, Yahoo, Baidu or IBM use it to perform computation-intensive tasks in the most diverse areas, for example, web analytics, behavioral targeting or at various Big Data problems. Especially the problems posed by big data are supposed to be able to be solved by Hadoop.

General information

Hadoop is based on Java and is handled and developed by the Apache Software Foundation as a top-level Project. The framework is currently available in version 2.7.0. As a Java-based framework it can be run on standard hardware and not on expensive SAN solutions (storage attached networks). The larger the amount of data, the more difficult it is to process it. Therefore, the approach to Hadoop is fundamentally different from those that try to process big data on a classic high-performance computer. Hadoop breaks down the data and divides it up.

Hadoop was developed by Doug Cutting and Mike Cafarella in 2004 and 2005 for the search engine Nutch. Background for the Hadoop project was the publication of a paper by Google which presented the MapReduce algorithm. It enables the simultaneous processing of large amounts of data by dividing the computer capacity to clusters of computers or Google’s file system plays a central role as well, since the parallel processing of data in terms of storage requires special file management. Cutting recognized the importance of these developments and used the findings as inspiration for the Hadoop technology.[1]

Operation

The system architecture of Hadoop consists of four major modules that work together in the processing and storage of data and are interlinked.

  • HDFS: One of these modules is the Hadoop Distributed File System (HDFS), which is a prerequisite for the MapReduce algorithm, because distributed computing systems require special file management to optimally utilize all resources.
  • MapReduce: As a distributed computer system, Google’s MapReduce is one of the central components of Hadoop. MapReduce consists of a JobTracker which delegates tasks to available resources and a TaskTracker, which accepts the tasks and forwards them to the nodes. The TaskTracker is usually as close as possible to the location where the data is to be processed stored, thus reducing the network load of the node system. Each node is a computer unit. These are conventional computers based on Linux. The special feature of MapReduce is the organization of JobTracker and TaskTrackers. The latter are also referred to as slaves and the JobTracker and a TaskTracker form a master unit. The master delegates tasks; the slaves process them.
  • YARN: The latest version of MapReduce, also called YARN or MapReduce 2.0 (Mrv2), represents improved resource and task management. Using YARN (Yet Another Resource Negotiator) the tasks will be distributed on clusters with two JobTrackers. The Resource Manager (RM) is responsible for global coordination. The Application Master (AM) governs the tasks with regard to individual applications.
  • Hadoop Common: The aforementioned core Hadoop modules are complemented by a software framework. These include numerous libraries and applications and some of these differ greatly. Both HDFS and MapReduce or YARN rely on standard libraries.
  • Hadoop Projects: Depending on the requirements of the Big Data problem, different projects can be used. For example, Hive offers an extension of functionalities to use Hadoop as a data warehouse. Ambari, Avro, Cassandra, Cukwa, Hbase, Mahout, Pics, Spark, Tez, and Zookeeper extend Hadoop with special features and applications. All these projects are directly related to the Apache Software Foundation and are available with Apache licenses. Proprietary extensions that were developed by Oracle or IBM partly still exist. Oracle’s Big Data SQL extends the Hadoop framework to handle queries for relational SQL databases and with IBM’s InfoSphere BigInsights, clusters can be used by multiple clients.

Examples

Facebook uses Hadoop to store copies of internal reports and data sources, which exist in dimensional data warehouses. The company uses this data as a source for reporting and evaluation in terms of machine learning. With two large clusters of 12 petabytes and 3 petabytes, Facebook is a heavy user of Hadoop similar to Ebay, Yahoo, and Twitter.[2]

Relevance to search engine optimization

Hadoop is a modern big data environment that is suitable for many applications and can be extended with modules. Whether data mining, information retrieval, business intelligence or predictive modelling, Hadoop can in particular provide reliable results with very large amounts of data and automate computing processes. Originally, Hadoop was designed as a search engine. Nutch was the web crawler and Hadoop was later transferred into a separate project along with the Nutch infrastructure, index, and databases, because the far-reaching benefits of the framework became clear to the people involved during the course of the project.[3] Such benefits include web analytics, cross-selling in online shops or ad placement because of user behavior in affiliate networks.

Due to its properties, Hadoop is not recommendable for small projects. While Hadoop is quite flexible because of the distributed computing systems, a certain amount of data should exist to justify the development effort, because Hadoop is an environment with multiple libraries, programs, and applications and these must be adapted to the special requirements.[4]

References

  1. a Free Software Program, Finds Uses Beyond Search nytimes.com. Accessed on 04/27/2015
  2. Hadoop PoweredBy wiki.apache.org. Accessed on 04/27/2015
  3. Hadoop: What it is and why it matters sas.com. Accessed on 04/27/2015
  4. It's Complicated: Why The Hadoop Skills Gap Will Widen in 2015 forbes.com. Accessed on 04/27/2015

Links