Apache™ Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.
Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
- Processing – MapReduce
Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.” It was designed to run on commodity hardware and to scale up or down without system interruption.
- Storage – HDFS
Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.
- Resource Management – YARN (New in Hadoop 2.0)
YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models. The YARN based architecture of Hadoop 2 is the most significant change introduced to the Hadoop project.
A Hadoop Distribution
A number of supporting ASF projects enable the integration of core Apache Hadoop into a data center environment. Typically, these projects are packaged into a Hadoop ”distribution”, which is a tested and hardened set of projects that simplifies a Hadoop implementation.
The distribution package is crucial because it ensures version compatibility among projects and more importantly, is typically subjected to significant testing to ensure it is reliable and stable.
The Ecosystem of Hadoop Related Projects
There are numerous ASF projects included in a distribution. Each of them has been developed to deliver an explicit function and each has it’s own community of developers and individual release cycles. Below is an outline of the Hadoop related Apache projects.
- Apache Pig
A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
- Apache HCatalog
A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
- Apache Hive
Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
- Apache HBase
A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
- Apache ZooKeeper
A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
- Apache Ambari
An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
- Apache Sqoop
Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. i provides a reliable parallel load for various, popular enterprise data sources.
- Apache Oozie
Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
- Apache Mahout
Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
- Apache Flume
Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to to Hadoop.
The future of Hadoop is almost here. Other crucial projects to deliver Hadoop 2.0 include:
- Apache YARN
Part of the core Hadoop project, Apache YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
- Apache Tez
Apache Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.