Apache Hadoop has been considered as one of the newer and one of the best technologies designed to extract meaning out of "Big Data".
Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It is a library of open source software used to create a distributed computing environment.
Although it has been a long time since Hadoop came into existent, many people still have misconceptions that need to be corrected.
Let's look at some of the most common myths about Hadoop and big data that companies should know before committing to a Hadoop project.
People have an assumption that Hadoop is a singular product, but it is actually made up of multiple products.
"Hadoop is the brand name of a family of open source products; those products are incubated and administered by Apache software."
When people typically think of Hadoop, they think of its Hadoop Distributed File System (HDFS), which is a foundation for other products, similar to MapReduce.
Myth-2: Hadoop is only about data volume.
Some people think Hadoop as technology designed for high volumes of data, but Hadoop's real value is its power to handle diverse data.
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. But the main advantage of Hadoop is the ability to analyze and extract meaningful information from the huge data volume.
Myth-3: All the components of Hadoop are open source only.
Apache Hadoop's open-source software library is available from Apache Software Foundation and can be downloaded for free from apache.org.
But vendors like IBM, Cloudera and EMC Greenplum have also made Hadoop available through special distribution.
Those distributions tend to come with added features such as administrative tools not offered by Apache Hadoop as well as support and maintenance.
A handful of vendors also offer their own non-Hadoop-based implementations of MapReduce.
Myth-4: Hadoop is the only answer to "Big Data".
Big Data and Hadoop have become synonyms but Hadoop is not the only answer to Big Data.
In fact some companies were working on Big Data even before Hadoop exists. There are other for Big Data too like Teradata, Vertica etc.
Myth-5: HDFS is the database management system of Hadoop.
Hadoop is mainly a distributed file system and it does not have the capabilities of database management system (DBMS) such as indexing, random access to data, support for standard SQL, and query optimization.
To get minimal DBMS functionality, we can use HBase and Hive on top of HDFS.
Myth-6: Hive is nothing but SQL based tool.
Hive is SQL-based tool. This means that people who are efficient in SQL can quickly learn Hive. But this does not solve the compatibility issues with SQL-based tools.
In future, Hadoop might support standard SQL and SQL-based tools. But currently it does not.
Myth-7: MapReduce is an integral part of HDFS.
MapReduce was developed by Google before HDFS existed. Although HDFS and MapReduce is a good combination, both can be used independently.
Some vendors such as MapR are creating variations of MapReduce that do not need HDFS. Some users deploy HDFS with Hive or HBase, but not MapReduce.
Myth-8: Hadoop is mainly used by Internet companies for analyzing Web logs and other Web data.
We get lot of news about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist.
Railroad companies are, for example, using sensors to detect unusually high temperatures on rail cars, which can signal an impeding failure.
Older analytic applications that need large data samples - such as customer base segmentation, fraud detection, and risk analysis - can benefit from the additional big data managed by Hadoop.
Myth-9: Hadoop is totally Free.
People think that Hadoop is open source so there is no cost involved at all. This is not true.
The lack of features such as administrative tools and support can create additional costs.
There is also the hardware cost of a Hadoop cluster or the real estate and the power it takes to make that cluster operational.
Myth-10: Hadoop is an alternative/replacement of Data Warehouse.
Data warehouses still do the work and Hadoop actually complement the data warehouse by becoming "an edge system".
Many Data Warehouses were designed only for structured, relational data which makes it difficult to use unstructured data. Hadoop becomes a helping hand in this case.
Happy Learning :) ...
Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It is a library of open source software used to create a distributed computing environment.
Although it has been a long time since Hadoop came into existent, many people still have misconceptions that need to be corrected.
Let's look at some of the most common myths about Hadoop and big data that companies should know before committing to a Hadoop project.
Myth-1: Hadoop is single product.
Fact: Hadoop consists of multiple products.
"Hadoop is the brand name of a family of open source products; those products are incubated and administered by Apache software."
The Apache Hadoop library includes:
- The Hadoop Distributed File System (HDFS)
- MapReduce
- Pig
- Hive
- HBase
- HCatalog
- Mahout and so on.
When people typically think of Hadoop, they think of its Hadoop Distributed File System (HDFS), which is a foundation for other products, similar to MapReduce.
Myth-2: Hadoop is only about data volume.
Fact: Hadoop is also about data diversity, not just data volume.
Some people think Hadoop as technology designed for high volumes of data, but Hadoop's real value is its power to handle diverse data.
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. But the main advantage of Hadoop is the ability to analyze and extract meaningful information from the huge data volume.
Myth-3: All the components of Hadoop are open source only.
Fact: Hadoop is open source but available from proprietary vendors too.
Apache Hadoop's open-source software library is available from Apache Software Foundation and can be downloaded for free from apache.org.
But vendors like IBM, Cloudera and EMC Greenplum have also made Hadoop available through special distribution.
Those distributions tend to come with added features such as administrative tools not offered by Apache Hadoop as well as support and maintenance.
A handful of vendors also offer their own non-Hadoop-based implementations of MapReduce.
Myth-4: Hadoop is the only answer to "Big Data".
Fact: Big data does not always require Hadoop.
Big Data and Hadoop have become synonyms but Hadoop is not the only answer to Big Data.
In fact some companies were working on Big Data even before Hadoop exists. There are other for Big Data too like Teradata, Vertica etc.
Myth-5: HDFS is the database management system of Hadoop.
Fact: HDFS is a file system, not a database management system (DBMS).
Hadoop is mainly a distributed file system and it does not have the capabilities of database management system (DBMS) such as indexing, random access to data, support for standard SQL, and query optimization.
To get minimal DBMS functionality, we can use HBase and Hive on top of HDFS.
Myth-6: Hive is nothing but SQL based tool.
Fact: Hive resembles SQL but is not standard SQL.
Hive is SQL-based tool. This means that people who are efficient in SQL can quickly learn Hive. But this does not solve the compatibility issues with SQL-based tools.
In future, Hadoop might support standard SQL and SQL-based tools. But currently it does not.
Myth-7: MapReduce is an integral part of HDFS.
Fact: HDFS and MapReduce are related but don't require each other.
MapReduce was developed by Google before HDFS existed. Although HDFS and MapReduce is a good combination, both can be used independently.
Some vendors such as MapR are creating variations of MapReduce that do not need HDFS. Some users deploy HDFS with Hive or HBase, but not MapReduce.
Myth-8: Hadoop is mainly used by Internet companies for analyzing Web logs and other Web data.
Fact: Hadoop enables many types of analytics, not just Web analytics.
We get lot of news about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist.
Railroad companies are, for example, using sensors to detect unusually high temperatures on rail cars, which can signal an impeding failure.
Older analytic applications that need large data samples - such as customer base segmentation, fraud detection, and risk analysis - can benefit from the additional big data managed by Hadoop.
Myth-9: Hadoop is totally Free.
Fact: Hadoop is open source but there is also deploying cost involved.
People think that Hadoop is open source so there is no cost involved at all. This is not true.
The lack of features such as administrative tools and support can create additional costs.
There is also the hardware cost of a Hadoop cluster or the real estate and the power it takes to make that cluster operational.
Myth-10: Hadoop is an alternative/replacement of Data Warehouse.
Fact: Hadoop helps a Data Warehouse. It is not a replacement.
Data warehouses still do the work and Hadoop actually complement the data warehouse by becoming "an edge system".
Many Data Warehouses were designed only for structured, relational data which makes it difficult to use unstructured data. Hadoop becomes a helping hand in this case.
Happy Learning :) ...