Apache Hadoop is an open source, scalable framework that enables the storage and processing of large volumes of data on the order of petabytes in distributed computing systems, i.e. computer clusters. Companies such as Facebook, AOL, Adobe, Amazon, Yahoo, Baidu or IBM use it to perform computation-intensive tasks in the most diverse areas, for example, web analytics, behavioral targeting or at various Big Data problems. Especially the problems posed by big data are supposed to be able to be solved by Hadoop.
Hadoop is based on Java and is handled and developed by the Apache Software Foundation as a top-level Project. The framework is currently available in version 2.7.0. As a Java-based framework it can be run on standard hardware and not on expensive SAN solutions (storage attached networks). The larger the amount of data, the more difficult it is to process it. Therefore, the approach to Hadoop is fundamentally different from those that try to process big data on a classic high-performance computer. Hadoop breaks down the data and divides it up.
Hadoop was developed by Doug Cutting and Mike Cafarella in 2004 and 2005 for the search engine Nutch. Background for the Hadoop project was the publication of a paper by Google which presented the MapReduce algorithm. It enables the simultaneous processing of large amounts of data by dividing the computer capacity to clusters of computers or Google’s file system plays a central role as well, since the parallel processing of data in terms of storage requires special file management. Cutting recognized the importance of these developments and used the findings as inspiration for the Hadoop technology.
The system architecture of Hadoop consists of four major modules that work together in the processing and storage of data and are interlinked.
Facebook uses Hadoop to store copies of internal reports and data sources, which exist in dimensional data warehouses. The company uses this data as a source for reporting and evaluation in terms of machine learning. With two large clusters of 12 petabytes and 3 petabytes, Facebook is a heavy user of Hadoop similar to Ebay, Yahoo, and Twitter.
Hadoop is a modern big data environment that is suitable for many applications and can be extended with modules. Whether data mining, information retrieval, business intelligence or predictive modelling, Hadoop can in particular provide reliable results with very large amounts of data and automate computing processes. Originally, Hadoop was designed as a search engine. Nutch was the web crawler and Hadoop was later transferred into a separate project along with the Nutch infrastructure, index, and databases, because the far-reaching benefits of the framework became clear to the people involved during the course of the project. Such benefits include web analytics, cross-selling in online shops or ad placement because of user behavior in affiliate networks.
Due to its properties, Hadoop is not recommendable for small projects. While Hadoop is quite flexible because of the distributed computing systems, a certain amount of data should exist to justify the development effort, because Hadoop is an environment with multiple libraries, programs, and applications and these must be adapted to the special requirements.