This article talks about the fundamentals of Apache Hadoop software library and covers various components of Apache Hadoop infrastructure such as Hadoop Distributed File System, MapReduce and YARN.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Hadoop software library profile contains the following components, each of which can be utilized independently by an application:
- Hadoop Distributed File System (HDFS)
- Hadoop MapReduce
- Hadoop YARN
HDFS is a distributed, scalable, fault tolerant and portable file system written in Java for the Apache Hadoop framework. It is well suited for distributed storage using commodity hardware. Apache Hadoop provides shell-like commands to interact with HDFS directly.
Some of other distributed file systems are FTP file system, Amazon S3 file system and Windows Azure Storage Blobs (WASB). However none of these file systems are rack-aware which enables processing architecture to allocate the data stored near to them and thus provides increased performance.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Here is the diagram showing the HDFS architecture -
Before coming to Hadoop MapReduce, let's first get to MapReduce to get better understanding of it.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Below diagram shows how MapReduce could be utilized to count the word frequencies across various documents:
Moving on to Hadoop MapReduce, it internally uses MapReduce programming model to enable processing of vast amounts of data (multi-terabyte data-sets) in parallel on large clusters (thousand of nodes) of commodity hardware in a reliable, fault-tolerance manner.
A Hadoop MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Applications specify the input/output locations along with map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.
Hadoop YARN, also known as MapReduce v2, is second and improved version of Apache MapReduce framework. In Apache MapReduce cluster, there used to be a JobTracker node which was responsible for scheduling/tracking map, reduce jobs along with managing the slave nodes. This used to pose problems in case of thousand of nodes in a cluster, as JobTracker used to be overloaded with either one of two tasks resulting into the performance degradation.
YARN was designed to segregare two tasks of scheduling/tracking the jobs and resource management using separate daemons. YARN does it by having a ResourceManager which acts as Master node and per-slave node NodeManager responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
YARN ResourceManager is only responsible for managing the resources and scheduling the tasks. However it is no longer responsible for tracking the jobs status or restarting the failed jobs as it is the responsibility of ApplicationMaster.
There is one more component called ApplicationMaster that is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
Below diagram shows how ResourceManager, NodeManager and ApplicationMaster are placed into a cluster:
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.