Hadoop is the Buzz Word heard everywhere in the World Wide Web. The Moment you open your social Networking profile, there are countless Ads from profound Educational and training institutes about their expertise in teaching you Hadoop. One must get a doubt if it is just an illusive hype or Buzz Word speculating around the web. In this series of articles, we are about to clear off all the doubts about Hadoop from its Technical aspects and uncertainties stopping you from learning it.
Refer my article on Big Data for better understanding
Hadoop Definitions in Web are very technical:
Hadoop is Open Source Framework from Apache Foundation which has the ability to process data without any limitations of schema. It has the ability to run thousands of clusters (connection of several computers) with high throughput.
Yes, it definitely is painful to understand what it is at your first read. The articles in web almost concentrate on the well off user group who already are familiar with it. So I decided to simply the terms in a way I could.
Demystifying the Technicality in understanding keywords:
Open Source Software: It is software developed by enthusiastic group of programmers and available in web for free along with the Source code. Anyone can use it, recreate it using the available source code or improve it.
Framework: A framework is a set of heavily used programs put together while designing a technology to make the developers life easy. It allows the development to be done quickly as one doesn’t have to write the low level functions and can concentrate on the logic.
Hadoop Cluster: Collection of systems connected to each other over a shared network to form a cluster of systems
How does Hadoop work?
Hadoop is a Free (or) Open Source Software which can process huge amounts of data also called Big Data. Big Data is huge amount of data which cannot be handled by traditional databases or softwares available in the market. This data was causing lot of trouble for the organizations in maintainance. There were a need for much sophisticated tool which can do the trick. Hadoop was created to solve this problem. Hence we can define relation between Hadoop and Big Data as below.
“Big Data is a problem. Hadoop is the solution”
Factors that made Hadoop the Darling of Data Nerds:
Hadoop is a complex software system made out of simple concept. It is just like a math’s aptitude worker problem. If you can answer the below question then you already know the way Hadoop works.
If a person takes 10 hours to build a wall, the time taken by 10 people to do the same work is ….. ?
Answer: 1 hour
It is very simple, isn’t it? How does Hadoop make use of this principle?
Yes, Hadoop does the same. If it takes 1 hour to process a program on single machine to process 1 TB of data in a traditional database system, Hadoop uses a network of systems joined together to form a cluster and distributes the work load among them to achieve the results at a much faster phase. These systems share the workload among themselves which results in faster execution.
Motivation for Hadoop:
Hadoop is built based on White papers released by Google on papers describing Google File System in 2003 and MapReduce in 2004. As the World Wide Web grew rapidly in 1990’s there was a need for automation of search engines, which were earlier answered by humans. This resulted in the creation of Web Crawlers and search engines like Yahoo.
One such project was Nutch, an open source search engine which was developed by Doug Cutting and Mike Cafarella. Later Doug Cutting moved into Yahoo and developed the Hadoop project based on his previous expertise on Nutch. They later released Hadoop as Open Source project and Apache Corporation is maintaining the Hadoop and its supporting projects which will be explained in the further articles.
Doug Cutting named the project as Hadoop after his son’s stuffed toy elephant. You can see Doug Cutting, the father of Hadoop with his sons stuffed toy (Hadoop) in the below image.
Factors that make Hadoop stand out from the crowd.
Distributed-Computing: Hadoop framework divides the data and stores it in multiple computers. Computations are run parallel across all connected systems.
Scalability: Scalability is Hadoop’s ability to connect to large number of systems and process huge volumes of data without any difficulty.
Schema-Less: Hadoop accepts all formats of data including Semi Structured and Unstructured along with structured data. It also provides us with the environment to analyze all the data types specified above.
Fault-tolerant: Computers tend to fail, they are machines after all. However, Hadoop doesn’t give a damn about system failure. Why should we worry about the system failure when it is equipped with superlative fault tolerant mechanism, Hadoop has the ability to switch to backup systems without human interference in case of failure.
Commodity Hardware: Hadoop can run on simple hardware which is very cheap. The Computer servers used at production level are very expensive. Hadoop doesn’t need these kind of high end devices due to its fault tolerant nature.
Move the Code to Data: This is one feature which really makes Hadoop a better tool to work with. In general systems when we try to execute a piece of code, it gets the data from the server to the system where the code is executed. However Hadoop does exactly opposite, it sends the code to the data location and executes it locally and get the results back. This even makes execution faster due to its distributed nature.
Hadoop Framework consists of two main layers
Hadoop Distributed File System (HDFS): HDFS is responsible for the storage in Hadoop.
MapReduce (Execution Engine): It is the execution engine which is responsible for processing of large datasets in distributed environment.
Hadoop follows Master–Slave architecture. There is a Master system which commands the slaves to perform the operations.
This is just an overview of Hadoop architecture. The key concepts will be discussed briefly in the next series of articles to keep the article short and interesting.
Note: This article is first of the series of Hadoop articles.