The state of Hadoop distributions
By David WORMS
May 11, 2013
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a daunting task. Below is a list of the main distributions including Hadoop. This follows an article published a few days ago about the Hadoop ecosystem.
A Hadoop Cluster is not limited to HDFS and Map/Reduce. Zookeeper, HBase, Hive, HCatalog, Oozie, Pig or Sqoop, they all seem unmissable as they address various and complementary concerns. Today, you will want to offer the possibility to your users to use YARN while maintaining compatibility with the older and origin Map/Reduce framework. Additionally, you will need more tool such as Ganglia and Nagios to monitor and survey your cluster.
There are many pieces in this puzzle. Writing scripts to deploy and upgrade all those components is not an easy task. Not all version are compatible. To make it harder, the versioning strategy is a bit cryptic. It has become a little easier but not yet simple. Below is a diagram published by Cloudera in April 2012 illustrating the situation at the time.
All of this to say that considering the usage of a distribution for Hadoop is not an esoteric decision.
Today, the oldest and most popular distribution is the Cloudera’s Distribution. It is a good choice which I recommend to my customers. However, as of today, I will bet that the 100% Open Source Hortonworks Data Platform including Ambari is the most promising distribution and the one I use personally on my laptop.
Other distributions include the commercial MapR and InfoSphere BigInsights. Lately, we have seems new distributions appearing such as WANdisco Hadoop WDD, Intel Distribution for Hadoop and Pivotal HD from EMC Greenplum.
Finally, it is worth mentioning the appliances including Apache Hadoop:
- Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
- Netapp’s Hadooplers (the link no longer exists)
- EMC Greenplum DCA
- Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
- Data Direct Networks (DDN)
More about the Cloudera’s Distribution
The large majority of Cloudera effort is open sourced through the Cloudera GitHub account until it eventually land in Apache Incubator before graduating as a top level Apache project. Projects such as Flume and Sqoop are example of top level Apache projects. The Cloudera Manager is the only project that I can think of which isn’t open sourced. Also, while being Open Source, Hue seems limited to the Cloudera’s Distribution but should soon be distributed with the Hortonworks plateform.
More about the Hortonworks Data Platform
Looking at what happened in the past 2 years and at what is expected in the next 2 years, HortonWorks is at the core of Hadoop development. Projects like YARN, HCatalog, Ambari and Tez originate from HortonWorks.
Being in Apache incubation, Ambari is in a good position to become the standard for Hadoop deployment over concurrents such as the Cloudera Manager. Among the component it manages, HCatalog, Ganglia and Nagios are of particular interest.
More about Pivotal HD
Pivotal HD will incorporate Project Hawq, an SQL database layer that rides atop of HDFS rather than trying to replace it with a NoSQL data store. It takes the parallel guts of the Greenplum database and reworks them to transform the Hadoop Distributed File System (HDFS) into something that speaks perfectly fluent SQL.
More bout the Intel Distribution for Hadoop
The Intel Distribution is the first to provide complete encryption with support of Intel® AES New Instructions (Intel® AES-NI) in the Intel® Xeon® processor. By incorporating silicon-based encryption support of the Hadoop Distributed File System*, organizations can now more securely analyze their data sets without compromising performance. The optimizations made for the networking and IO technologies in the Intel Xeon processor platform also enable new levels of analytic performance. Analyzing one terabyte of data, which would previously takes more than 4 hours to fully process, can now be done in 7 minutes1 thanks to the data-crunching combination of Intel’s hardware and the Intel Distribution.
More about Oracle’s Big Data appliance
The rack configuration is pre-integrated with 18 nodes that include InfiniBand and Ethernet connectivity. It includes the Cloudera’s Distribution and the Oracle NoSQL Database Community Edition to acquire data.
More about Netapp’s Hadooplers
Networked storage configurations familiar to NetApp are not common in most Hadoop clusters. Therefore, NetApp’s first Hadoopler is pre-configured with server attached storage (SAS) ports directly attached to each data node. There is no switch involved in the storage configuration of this Hadoopler. Local disk semantics and performance is what every Hadoop data node expects today and that is precisely how the Hadoopler is configured.
Shared DAS addresses the inevitable storage capacity growth requirements of Hadoop nodes in a cluster by placing disks in an external shelf shared by multiple directly attached hosts (aka Hadoop compute nodes). The connectivity from host to disk can be SATA, SAS, SCSI or even Ethernet, but always in a direct rather than networked storage configuration. Therefore Shared DAS does not use a storage switch.
NetApp is committed to the open Apache Distribution of Hadoop which it believes will serve as a long-term unifying force in the Hadoop community and the foundation for durable future innovation in the Big Data ecosystem. I read in the last few days that EMC is moving into the same direction.