Tuesday 9 June 2015

Top 10 Myths about Hadoop

Apache Hadoop has been considered as one of the newer and one of the best technologies designed to extract meaning out of "Big Data".
Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It is a library of open source software used to create a distributed computing environment.
Although it has been a long time since Hadoop came into existent, many people still have misconceptions that need to be corrected.

Let's look at some of the most common myths about Hadoop and big data that companies should know before committing to a Hadoop project.

Myth-1: Hadoop is single product. 

Fact: Hadoop consists of multiple products.


People have an assumption that Hadoop is a singular product, but it is actually made up of multiple products.
"Hadoop is the brand name of a family of open source products; those products are incubated and administered by Apache software."

The Apache Hadoop library includes:


  • The Hadoop Distributed File System (HDFS)
  • MapReduce
  • Pig
  • Hive
  • HBase
  • HCatalog
  • Mahout and so on.

When people typically think of Hadoop, they think of its Hadoop Distributed File System (HDFS), which is a foundation for other products, similar to MapReduce.

Myth-2: Hadoop is only about data volume. 
Fact: Hadoop is also about data diversity, not just data volume. 


Some people think Hadoop as technology designed for high volumes of data, but Hadoop's real value is its power to handle diverse data.

Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. But the main advantage of Hadoop is the ability to analyze and extract meaningful information from the huge data volume.


Myth-3: All the components of Hadoop are open source only. 
Fact: Hadoop is open source but available from proprietary vendors too.


Apache Hadoop's open-source software library is available from Apache Software Foundation and can be downloaded for free from apache.org.
But vendors like IBM, Cloudera and EMC Greenplum have also made Hadoop available through special distribution.

Those distributions tend to come with added features such as administrative tools not offered by Apache Hadoop as well as support and maintenance.
A handful of vendors also offer their own non-Hadoop-based implementations of MapReduce.


Myth-4: Hadoop is the only answer to "Big Data". 
Fact: Big data does not always require Hadoop. 


Big Data and Hadoop have become synonyms but Hadoop is not the only answer to Big Data.
In fact some companies were working on Big Data even before Hadoop exists. There are other for Big Data too like Teradata, Vertica etc.


Myth-5: HDFS is the database management system of Hadoop. 
Fact: HDFS is a file system, not a database management system (DBMS). 


Hadoop is mainly a distributed file system and it does not have the capabilities of database management system (DBMS) such as indexing, random access to data, support for standard SQL, and query optimization.
To get minimal DBMS functionality, we can use HBase and Hive on top of HDFS.


Myth-6: Hive is nothing but SQL based tool. 
Fact: Hive resembles SQL but is not standard SQL. 


Hive is SQL-based tool. This means that people who are efficient in SQL can quickly learn Hive. But this does not solve the compatibility issues with SQL-based tools.
In future, Hadoop might support standard SQL and SQL-based tools. But currently it does not.


Myth-7: MapReduce is an integral part of HDFS. 
Fact: HDFS and MapReduce are related but don't require each other. 


MapReduce was developed by Google before HDFS existed. Although HDFS and MapReduce is a good combination, both can be used independently.
Some vendors such as MapR are creating variations of MapReduce that do not need HDFS. Some users deploy HDFS with Hive or HBase, but not MapReduce.


Myth-8: Hadoop is mainly used by Internet companies for analyzing Web logs and other Web data.
Fact: Hadoop enables many types of analytics, not just Web analytics. 


We get lot of news about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist.

Railroad companies are, for example, using sensors to detect unusually high temperatures on rail cars, which can signal an impeding failure.
Older analytic applications that need large data samples - such as customer base segmentation, fraud detection, and risk analysis - can benefit from the additional big data managed by Hadoop.


Myth-9: Hadoop is totally Free. 
Fact: Hadoop is open source but there is also deploying cost involved. 


People think that Hadoop is open source so there is no cost involved at all. This is not true.

The lack of features such as administrative tools and support can create additional costs.
There is also the hardware cost of a Hadoop cluster or the real estate and the power it takes to make that cluster operational.


Myth-10: Hadoop is an alternative/replacement of Data Warehouse. 
Fact: Hadoop helps a Data Warehouse. It is not a replacement. 


Data warehouses still do the work and Hadoop actually complement the data warehouse by becoming "an edge system".
Many Data Warehouses were designed only for structured, relational data which makes it difficult to use unstructured data. Hadoop becomes a helping hand in this case.

Happy Learning :) ...

MapReduce - The Heart of Hadoop

In this article, we will learn:

  1. What is MapReduce
  2. Few interesting facts about MapReduce
  3. MapReduce component and architecture
  4. How MapReduce works in Hadoop


MapReduce:

MapReduce is a programming model which is used to process large data sets in a batch processing manner.
A MapReduce program is composed of

  • a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name)
  • and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).


Few Important Facts about MapReduce:


  • Apache Hadoop Map-Reduce is an open source implementation of Google's Map Reduce Framework.
  • Although there are so many map-reduce implementation like Dryad from Microsoft, Dicso from Nokia which have been developed for distributed systems but Hadoop being the most popular among them offering open source implementation of Map-reduce framework.
  • Hadoop Map-Reduce framework works on Master/Slave architecture.


MapReduce Architecture:



Hadoop 1.x MapReduce is composed of two components.

  1. Job tracker playing the role of master and runs on MasterNode (Namenode)
  2. Task tracker playing the role of slave per data node and runs on Datanodes

Job Tracker:



  1. Job Tracker is the one to which client application submit mapreduce programs(jobs).
  2. Job Tracker schedule clients jobs and allocates task to the slave task trackers that are running on individual worker machines(date nodes).
  3. Job tracker manage overall execution of Map-Reduce job.
  4. Job tracker manages the resources of the cluster like:
    • Manage the data nodes i.e. task tracker.
    • To keep track of the consumed and available resource.
    • To keep track of already running task, to provide fault-tolerance for task etc.

Task Tracker:


  1. Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker.
  2. Task Tracker also handles the data motion between the map and reduce phases.
  3. One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task.
  4. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

How MapReduce Engine Works:

The Let us understand how exactly map reduce program gets executed in Hadoop. What is the relationship between different entities involved in this whole process. 

The entire process can be listed as follows:

  1. Client applications submit jobs to the JobTracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.

Let us see these steps in more details.

1. Client submits MapReduce job to Job Tracker: 

Whenever client/user submit map-reduce jobs, it goes straightaway to Job tracker. Client program contains all information like the map, combine and reduce function, input and output path of the data. 



2. Job Tracker Manage and Control Job: 

  • The JobTracker puts the job in a queue of pending jobs and then executes them on a FCFS(first come first serve) basis.
  • The Job Tracker first determine the number of split from the input path and assign different map and reduce tasks to each TaskTracker in the cluster. There will be one map task for each split.
  • Job tracker talks to the NameNode to determine the location of the data i.e. to determine the datanode which contains the data.




3. Task Assignment to Task Tracker by Job Tracker: 

  • The task tracker is pre-configured with a number of slots which indicates that how many task(in number) Task Tracker can accept. For example, a TaskTracker may be able to run two map tasks and two reduce tasks simultaneously.
  • When the job tracker tries to schedule a task, it looks for an empty slot in the TaskTracker running on the same server which hosts the datanode where the data for that task resides. If not found, it looks for the machine in the same rack. There is no consideration of system load during this allocation.



4. Task Execution by Task Tracker: 

  • Now when the Task is assigned to Task Tracker, Task tracker creates local environment to run the Task.
  • Task Tracker need the resources to run the job. Hence it copies any files needed from the distributed cache by the application to the local disk, localize all the job Jars by copying it from shared File system to Task Tracker's file system.
  • Task Tracker can also spawn multiple JVMs to handle many map or reduce tasks in parallel.
  • TaskTracker actually initiates the Map or Reduce tasks and reports progress back to the JobTracker.




5. Send notification to Job Tracker: 

  • When all the map tasks are done by different task tracker they will notify the Job Tracker. Job Tracker then ask the selected Task Trackers to do the Reduce Phase

6. Task recovery in failover situation: 

  • Although there is single TaskTracker on each node, Task Tracker spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job(process) crashes the JVM due to some bugs defined in user written map reduce function

7. Monitor Task Tracker : 

  • The TaskTracker nodes are monitored. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status.
  • If Task Tracker do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  • A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable.

8. Job Completion: 

  • When the work is completed, the JobTracker updates its status.
  • Client applications can poll the JobTracker for information.