2 minute read

Overview

Hadoop provides a software framework for distributed data storage (HDFS) & processing (MapReduce). Data and processing jobs are distributed across a large network or ‘cluster’ of computers.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.

The Hadoop framework itself is mostly written in Java, with some native code in C and command line utilities written as shell scripts. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (SSH) be set up between nodes in the cluster.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common

Contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS)

A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS has five services as follows:

  1. Name Node - A Master Node which can track the files, manage the file system and contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details.

  2. Secondary Name Node Stores the checkpoints of the file system metadata which is in the Name Node. This can be used to rebuild a Name Node if it fails.

  3. Job tracker Recieves requests for processing by the Map Reduce job & communicates with Name Node about the location of data.

  4. Data Node - Stores the actual data.

  5. Task Tracker - Takes commands & code from Job Tracker then applies it to data

Top three are Master Services/Daemons/Nodes and bottom two are Slave Services.

Hadoop YARN

YARN is an addition to Hadoop that brought it from V1. to V2. YARN strives to allocate resources to various applications effectively. It can be used to govern, maintain, monitor and schedule jobs of many shapes and sizes (not just MapReduce).

Yarn runs two dæmons, which take care of two different tasks: the resource manager, which does job tracking and resource allocation to applications, the application master, which monitors progress of the execution.

Hadoop V2. also introduced caching which can massively increase speed especially using smaller datasets.

Hadoop MapReduce

An implementation of the MapReduce programming model for large-scale data processing.

Architecture

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only and compute-only worker nodes.

In a larger cluster, HDFS nodes are managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode’s memory structures, thereby preventing file-system corruption and loss of data

Updated: