Hadoop Architecture is divided into 2 core layers, one for storage and the other handles the programming or computational part of Hadoop. One is a framework written in java to allow the system to store the various forms of data generated at a very quick phase, which can be collectively called as Big Data, the other is the programming engine of Hadoop which gives control to the user to access the data and perform analysis on it. As discussed in my previous article “What is Hadoop?” Hadoop consists of 2 layers.
- Hadoop Distributed File System (HDFS): HDFS is responsible for the storage in Hadoop.
- MapReduce (Execution Engine): It is the execution engine which is responsible for the processing of large data-sets in distributed environment.
Due to the huge architecture of Hadoop, I am writing the articles as a series of references. As part of it, we will discuss about the core architecture of Hadoop in this article.
Master – Slave Architecture:
HDFS has a Master – Slave architecture. It is a type of architecture in computer networking, where communication is established between 2 or more devices and one device controls the other devices or processes. In this kind of system, once the relationship is established, the control is always directed from master devices to slave devices. Hadoop follows certain process.
The below image shows the Master – Slave architecture of Hadoop and also are differentiated to give an understanding of what is master and what is slave.
Hadoop comprises of 5 daemons. These processes are always available and run on their own JVM. These are again segregated into two parts.
- Name Node
- Secondary Name Node
- Job Tracker
What is a Daemon?
Daemon is a thread which is always available for service requests. These are commonly used in UNIX / LINUX terminology, one such example of a daemon is Java garbage collector. In Hadoop, It is a never ending process which runs on its own JVM.
Each of the above daemons has its own importance in the functioning of Hadoop as a system. Hadoop would never have become successful without any of the above daemons. Each of the above liat will be briefly explained below.
Name Node is the Master of HDFS file system. It maintains the file system namespace and the Meta data information of blocks in which the data is stored and their location. It is the most essential part of the Hadoop. It is a single point of failure, if it crashes the whole Hadoop cluster goes down. Name Node stores the metadata information in two files, FSImage and EditLog.
Secondary Name Node (Hadoop 1.0) / Check Point Node (Hadoop 2.0):
Secondary Name Node does the house keeping activities of Name Node. It keeps the checkpoint updates of the name node files. It is poorly named constraint, It cannot be a complete backup for name node as the name suggests. The name Secondary Name Node is deprecated, It is named as the Check Point Node in Hadoop 2.0
Job Tracker is the master service executed in the background on the name node services. It is responsible for breaking the jobs / MapReduce jobs into pieces and are deployed to the slave nodes and track the information of the execution status and their health of the job tracker in the slave / data nodes.
Data Node is the slave node, which performs the tasks assigned by the Name node. The actual data is stored in the data node and the name node only stores the meta data information sent through block reports sent by the data node.
It is the process which is actually responsible for running the mapreduce jobs on the data nodes. It receives the piece of code from the job tracker and executes them and updates the status continuously to the job tracker.
Since, the last two process are executed on the command of master nodes they are called the slave nodes.
Thank you so much for studying our articles and my next article would be on the key features of HDFS.