State of the Hadoop open-source ecosystem in early 2013

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume, Oozie, Avro, Sqoop, Cascading, Cascalog, HUE, Mahout to name a few. At the same time, many Open Source projects are appearing on GitHub and elsewhere or being introduced Apache Incubator.

Ingest/Collect: Flume, Scribe, Sqoop, HIHO, Chukwa
Workflow, Coordination, ETL: Oozie, nifi, Cascading, Luigi, azkaban, Crunch, sprunch, Falcon, Morphlines (edit: now Kite)
Map/Reduce Processing: Pig, Lipstick, Cascalog, Scoobi, Scalding, Parkour, PigPen
SQL querying: Hive, HCatalog, Tajo, Impala, Presto, Splout, phoenix, Drill, Shark, MRQL, BigSQL
Graph processing: Giraph, Titan, Pegasus
Statistics, text mining: Mahout, DataFu, PredictionIO, Cloudera ML, RHadoop, Mavuno, Bixo, SnowPlow, Oryx, Myrrix
Compute engine, Search: Weave, Hama, Pangool, Kiji
Stream: Storm, Summingbird, Samza, Jubatus
Storage: HDFS, Tachyon
Databases: HBase, ElephantDB, Accumulo, Tephra, Omid, Blur, Gora, Culvert, Panthera ASE, Panthera DOT, OpenTSDB, Kylin
MapReduce libraries, Hive UDFs, drivers: GIS Tools for Hadoop, Hive map udaf, Hivemall, Mongo-Hadoop Adapter, Quest Sqoop plugin, Hourglass
File Formats: Trevni, Parquet, Hive ORC file, forqlift
Publish/Subscribe System: BookKeeper, Kafka
Engines: Tez, Cubert
Describe, Develop: Elephant Bird, Twill, Flint, HDT, BIRT, Karmasphere, hRaven, Knime
Job scheduling: Mezo, Corona
Deploy, Administer: Hue, Ambari, Ryba, SmartFrog, Oink, Hannibal, Genie, Whirr, Knox, Rhino, Sentry, Pallet, Helix, Sahara, Exhibitor, osquery
Monitor, Benchmark: Ganglia, Nagios, White Elephant, timberlake, HiTune, HiBench, HDFS-DU
Book and documentation: Data-Intensive Text Processing with MapReduce

Ingest/Collect

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Scribe

Scribe is a server for aggregating log data that’s streamed in real time from clients. It is designed to be scalable and reliable.

Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

HIHO: Hadoop In, Hadoop Out

HIHO, for Hadoop In, Hadoop Out, is a Data Integration tool for Hadoop with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.

Chukwa

Chukwa is an Open Source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Workflow, Coordination, ETL

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

NiFi

Apache NiFi is a dataflow system based on the concepts of flow-based programming. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Cascading

Cascading is a data processing API and processing query planner used for defining, sharing, and executing robust Data Analytics and Data Management applications on Apache Hadoop. Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Luigi

Luigi]luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

Azkaban

Azkaban is a batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs. Often times there is a need to run a set of jobs and processes in a particular order within a workflow. Azkaban will resolve the ordering through job dependencies and provide an easy to use web user interface to maintain and track your workflows.

[Hamake][hamake]

Hamake is a lightweight utility based on dataflow programming model. Data dependency graph is expressed using just two data flow instructions: fold and foreach providing a clear processing model, similar to MapReduce, but on a dataset level. Another design goal was to create a simple to use utility that developers can start using right away without complex installation or extensive learning.

Crunch

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns.

Sprunch

Sprunch is a minimalist Scala API on top of Apache Crunch ( http://crunch.apache.org ) with the aim of removing boilerplate from Java + Crunch whilst adding as little complexity as possible.

Apache Falcon

Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

Cloudera Morphlines

For now distributed as part of the Cloudera Development Kit, Morphlines is a new Open Source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.

Map/Reduce Processing

Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Lipstick

Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes.

Cascalog

Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing “Big Data” on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

Scoobi

Scoobi focuses on making you more productive at building Hadoop applications. Scoobi is a library that leverages the Scala programming language to provide a programmer friendly abstraction around Hadoop’s MapReduce to facilitate rapid development of analytics and machine-learning algorithms.

Scalding

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

Parkour

Hadoop MapReduce in idiomatic Clojure. Parkour takes your Clojure code’s functional gymnastics and sends it free-running across the urban environment of your Hadoop cluster.

PigPen

PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.

SQL querying

Apache Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Tajo

Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and its optimizer.

HCatalog

Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes providing a shared schema and data type mechanism, providing a table abstraction so that users need not be concerned with where or how their data is stored and providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

Cloudera Impala

Cloudera Impala is a distributed query execution engine that runs against data stored natively in Apache HDFS and Apache HBase.

Presto

Presto is an Open Source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Lingual

Lingual executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.

Splout (edit: now defunct)

Splout serves partitioned SQL databases which are generated and indexed by Hadoop. It is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value.

Phoenix JDBC Driver for HBase

Phoenix is a SQL layer over HBase delivered as an embedded JDBC driver targeting low latency queries over HBase data. Tables are created and updated through DDL statements and stored and versioned on the server in an HBase table. Columns are defined as either being part of a multi-part row key or as key value cells. The Phoenix query engine transforms your SQL query into one or more HBase scans and orchestrates their execution to produce standard JDBC result sets.

Drill

Apache Drill is a distributed system for interactive analysis of large-scale datasets, based on Google’s Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Shark(Hive on Spark)

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to the existing data nor queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

Spark is an Open Source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

Apache MRQL

MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in two modes: in MapReduce mode on top of Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of Apache Hama. The MRQL query language is powerful enough to express most common data analysis tasks over many forms of raw in-situ data, such as XML and JSON documents, binary files, and CSV documents. MRQL is more powerful than other current high-level MapReduce languages, such as Hive and PigLatin, since it can operate on more complex data and supports more powerful query constructs, thus eliminating the need for using explicit MapReduce code. With MRQL, users will be able to express complex data analysis tasks, such as PageRank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing system will be able to compile these queries to efficient Java code.

BigSQL (edit: now defunct)

BigSQL™ provides a simple way to try out a complete Big SQL bundle of PostgreSQL and Hadoop.

Graph processing

Apache Giraph

[Apache Giraph] implement a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

Titan

Titan is a highly scalable graph database optimized for storing and querying large graphs with billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users. It support for various storage backends (Apache Cassandra, Apache HBase, Oracle BerkeleyDB) and various indexing backends (ElasticSearch, Apache Lucene).

Pegasus

PEGASUS is a Peta-scale graph mining system, fully written in Java. It runs in parallel, distributed manner on top of Hadoop. Existing works on graph mining has limited scalability: usually, the maximum graph size is order of millions. PEGASUS breaks the limit by scaling up the algorithms to billion-scale graphs.

Statistics, text mining

Apache Mahout

The Apache Mahout™ machine learning library’s goal is to build scalable machine learning libraries.

DataFu

DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics.

PredictionIO

PredictionIO is designed for developers to rapidly add predictive features, such as recommendation, personalization and user behavior prediction, into their applications. We built PredictionIO in Scala, with Play Framework as the REST API server backbone. It comes with Mahout algorithms as well as those written in Cascading and Scalding. Hadoop, MongoDB and Quartz make its infrastructure scalable.

Cloudera ML

Cloudera ML is a collection of Java libraries and commandline tools for performing certain data preparation and analysis tasks that are often referred to as “advanced analytics” or “machine learning.” Our focus is on simplicity, reliability, easy model interpretation, minimal parameter tuning, and integration with other tools for data preparation and analysis.

RHadoop

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) & (CDH4). and R 2.15.0. The packages have also been tested with Revolution R 4.3, 5.0, and 6.0.

Mavuno

Mavuno is an Open Source, modular, scalable text mining toolkit built upon Hadoop. It supports basic natural language processing tasks (e.g., part of speech tagging, chunking, parsing, named entity recognition), is capable of large-scale distributional similarity computations (e.g., synonym, paraphrase, and lexical variant mining), and has information extraction capabilities (e.g., instance and semantic relation mining). It can easily be adapted to new input formats and text mining tasks.

Bixo

Bixo is an Open Source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.

SnowPlow

SnowPlow is an Open-Source web-scale analytics, powered by Hadoop and Hive. It does three things: identifies users, and tracks the way they engage with a website or app; stores the associated behavioural data in a scalable “clickstream” data warehouse; makes it possible to leverage a Big Data toolset (e.g. Hive, Pig, Mahout) to analyse that event data.

Gertrude

Gertrude is a Java implementation of the overlapping experiments infrastructure used at Google and first described in Tang et al. (2010). It is designed to be powerful enough to support the types of experiments that Data Scientists, machine learning researchers, and software engineers need to run when developing data products (e.g., recommendation engines, search ranking algorithms, and large-scale classifiers), although it can also be used for testing new features and UI treatments.

Oryx

The Oryx Open Source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommendation, classification / regression, and clustering. It can continuously build models from a stream of data at large scale using Apache Hadoop. It also serves queries of those models in real-time via an HTTP REST API, and can update models approximately in response to streaming new data. This two-tier design, comprised of the Computation Layer and Serving Layer, respectively, implement a lambda architecture. Models are exchanged in PMML format.

Myrrix (discontinued)

Myrrix is a complete, real-time, scalable clustering and recommender system, evolved from Apache Mahout™. Just as we take for granted easy access to powerful, economical storage and computing today, Myrrix will let you take for granted easy access to large-scale “Big Learning” from data.

Compute engine, Search

Weave

Weave simplify the writing of distributed application on YARN and aims at making it as simple as running threads on JVM.

Hama

Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

Pangool

Pangool aims to simplify Hadoop development without loosing the performance and flexibility that the low-level Hadoop API provides.

Kiji

The Kiji Project is a modular, Open-Source framework based on HBase that enables developers and analysts to collect, analyze and use data in real-time applications. Product and content recommendation systems, risk analysis monitoring, customer profile and segmentation applications, energy usage analytics & reporting are such exemples of Kiji usecase.

Lily (defunct)

Lily is a data management platform based on HBase combining planet-sized data storage, indexing and search with on-line, real-time usage tracking, audience analytics and content recommendations

Stream

Storm

Storm is a free and Open Source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Not directly a Hadoop component, it is often used conjointly and it is more and more integrated with it. More specifically, it supports Hadoop style security mechanisms and it is being integrated into Hadoop YARN for resource management.

Summingbird

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding. You can execute a Summingbird program in “batch mode” (using Scalding), in “realtime mode” (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Samza

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

Jubatus

(not directly a Hadoop component) Jubatus is the first Open Source platform for online distributed machine learning on the data streams of Big Data. It uses a loose model sharing architecture for efficient training and sharing of machine learning models, by defining three fundamental operations; Update, Mix, and Analyze, in a similar way with the Map and Reduce operations in Hadoop.

Storage

HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Tachyon

Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.

Databases

Apache HBase

Apache HBase is an Open-Source, distributed, versioned, column-oriented store modeled after Google’s Bigtable.

ElephantDB

ElephantDB is a database that specializes in exporting key/value data from Hadoop. ElephantDB is composed of two components. The first is a library that is used in MapReduce jobs for creating an indexed key/value dataset that is stored on a distributed filesystem. The second component is a daemon that can download a subset of a dataset and serve it in a read-only, random-access fashion.

Apache Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Tephra

Continuuity Tephra provides globally consistent transactions on top of Apache HBase.

Omid

The Omid project provides transactional support for key-value stores using Snapshot Isolation. Omid stands for Optimistically transactional Management in Datastores. At this stage of the project, HBase is the only supported data-store.

Blur

Blur combines the speed of document-oriented databases with the ability to build rich data models to deliver lightening-fast answers to complex queries. It leverages multiple Open Source projects including Hadoop, HDFS, Lucene, Thrift and Zookeeper to create an environment where structured data can be transformed into a sharded index that runs on a distributed (cloud) computing environment.

Gora

The Apache Gora Open Source framework provides an in-memory data model and persistence for Big Data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Culvert

Culvert is a secondary indexing platform for BigTable, which means it provides everything you need to write indexes and use them to access and well, index your data. Currently Culvert supports HBase and Accumulo, though new adapters are in the pipeline.

Intel Panthera ASE (edit: defunct)

Panthera ASE, a new analytical SQL engine for Hadoop, provide full SQL support for OLAP applications on Hadoop. It enables complex and sophisticated SQL queries on Hadoop through a series of advanced transformations and optimizations (e.g., subquery unnesting), and currently uses Hive as its execution backend (with necessary enhancements to the underlying stack being added and contributed back to the existing Apache projects).

Intel Panthera DOT

Panthera DOT, document oriented table on HBase, provides an efficient storage engine for relational SQL query with high-update rate by leveraging the SQL relation mondel to provide document semantics, and is currently implemented as an HBase co-processor application.

OpenTSDB

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase which helps to store, index and serve metrics collected from computer systems at a large scale. Like Ganglia, OpenTSDB can be used to monitor various systems including HBase.

Kylin

Apache Kylin is an Open Source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

MapReduce libraries, Hive UDFs, drivers

GIS Tools for Hadoop

GIS Tools for Hadoop is an Open Source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries:

Esri Geometry API for Java: A generic geometry library, can be used to extend Hadoop core with vector geometry types and operations, and enables developers to build MapReduce applications for spatial data.
Spatial Framework for Hadoop: Extends Hive and is based on the Esri Geometry API, to enable Hive Query Language users to leverage a set of analytical functions and geometry types. In addition to some utilities for JSON used in ArcGIS.
Geoprocessing Tools for Hadoop: Contains a set of ready to use ArcGIS Geoprocessing tools, based on the Esri Geometry API and Spatial Framework for Hadoop. Developers can download the source code of the tools and customize it; they can also create new tools and contribute it to the Open Source project. Through these tools ArcGIS users can move their spatial data and execute a pre-defined workflow inside Hadoop.

Hive map udaf

The project contains two Hive UDAF which convert an aggregation into a map and an ordered map.

Hivemall

Hivemall is a scalable machine learning library that runs on Apache Hive, licensed under the LGPL 2.1. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

Mongo-Hadoop Adapter

The mongo-hadoop adapter is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Quest Sqoop plugin

Quest® Data Connector for Oracle and Hadoop is an optional plugin to Sqoop. It facilitates the movement of data between Oracle and Hadoop. Quest® Data Transporter for Hive is distributed with Quest® Data Connector for Oracle and Hadoop. Quest® Data Transporter for Hive is a Java command-line utility that allows you to execute a Hive query and insert the results into an Oracle table.

Hourglass

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data. Two types of sliding window computations are extremely common in practice:

Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the number of visitors to a site from the past 30 days.
Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

File Formats

Trevni

Trevni defines a column-major file format for datasets achieving a higer query performance. To permit scalable, distributed query evaluation, datasets are partitioned into row groups, containing distinct collections of rows. Each row group is organized in column-major order, while row groups form a row-major partitioning of the entire dataset.

Parquet

Parquet is created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.

Hive ORC file

The ORC file format is a new columnar file format introduced by Hortonworks as part of the Stinger Initiative.

forqlift

forqlift helps you manage Hadoop SequenceFiles. forqlift’s primary goals are as follows: create a SequenceFile from files on your local disk; extract data from a SequenceFile back to local disk; list the contents of SequenceFiles.

Publish/Subscribe System

BookKeeper

The Apache BookKeeper subproject of ZooKeeper is made up of a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig.

Apache Kafka

Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following: persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages; high-throughput, even with very modest hardware Kafka can support hundreds of thousands of messages per second; explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics; support for parallel data load into Hadoop.

Engines

Tez

Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects.

Cubert (edit: defunct)

Cubert is a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop. Cubert is ideally suited for the following application domains: Statistical Calculations, Joins and Aggregations; Cubes and Grouping Set Aggregations; Time range calculation and Incremental computations; Graph computations; When performance or resources are a matter of concern.

Describe, Develop

Elephant Bird

Elephant Bird is Twitter’s Open Source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.

[Twill][twill

Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic. Apache Twill allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads. Many distributed applications have common needs such as application lifecycle management, service discovery, distributed process coordination and resiliency to failure, which are not out of the box features from YARN.

[Flint][flint

Flink features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations. Flink runs independently from Hadoop, but integrates seamlessly with YARN (Hadoop’s next-generation scheduler). Various file systems (including the Hadoop Distributed File System) can act as data sources.

Apache Hadoop Development Tools (HDT)

The Hadoop Development Tools (HDT) is a set of plugins for the Eclipse IDE for developing against the Hadoop platform.

BIRT

From the Eclipse Foundation, BIRT is an Open Source Eclipse-based reporting system that integrates with your Java/Java EE application to produce compelling reports. Starting with the Kepler version of Eclipse, Birt provides support to query HDFS, Hive, MongoDB and Cassandra.

Karmasphere

Natively designed for Hadoop, Karmasphere is a unified workspace for full-fidelity Big Data Analytics that provides access to all the data in its original form to preserve richness and flexibility. An open, standards-based solution for teams of data and business analysts who need to quickly and easily ingest, analyze and visualize Big Data on Hadoop, Karmasphere does not require users to be retrained in exotic technologies. Exploratory questions can be asked using familiar SQL. Data can also be visually explored using a drag and drop interface. Industry standard pre-packaged analytics (UDFs) can be used as building blocks to speed time to insight. Unlike other solutions, you do not need to prepare the data or know the answer before asking the questions.

Twitter’s hRaven

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.

Analytics

Pulsar

Pulsar is an Open-Source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering.

Knime

Build predictive analytics applications without coding? Yes you can, thanks to this handy toolkit from Swiss analytics vendor KNIME. KNIME’s Eclipse-based visual workbench lets you drag and configure nodes on a canvas, then wire them together into workflows. KNIME supplies thousands of nodes to handle everything from data source connectivity (databases, Hive, Excel, flat files) to SQL queries, ETL transforms, data mining (supporting Weka and R), and visualization.

Job scheduling

Mezo

Apache Mezo is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.

Corona

Hadoop Corona is the next version of Map-Reduce from Facebook. The current Map-Reduce has a single Job Tracker that reached its limits at Facebook. The Job Tracker manages the cluster resource and tracks the state of each job. In Hadoop Corona, the cluster resources are tracked by a central Cluster Manager.

Deploy, Administer

Hue

Hue is an Open Source web-based application for making it easier to use Apache Hadoop.

Hue features a file browser for HDFS, an Oozie Application for creating workflows and coordinators, a job designer/browser for MapReduce, Hive and Cloudera Impala query editors, a Shell, a collection of Hadoop API.

Ambari

Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari consists of a set of RESTful APIs and browser-based management console UI.

Ryba

Ryba is a to deploy and manage Hadoop clusters focusing on security and multi-tenancy. It uses Masson, an automation tool which runs the instructions to install and maintain your systems. It written for Node.js and leverage the SSH protocol to distribute the deployment of your cluster.

SmartFrog

SmartFrog is a powerful and flexible Java-based software framework for configuring, deploying and managing distributed software systems. SmartFrog helps you to encapsulate and manage systems so they are easy to configure and reconfigure, and so that that they can be automatically installed, started and shut down. It provides orchestration capabilities so that subsystems can be started (and stopped) in the right order. It also helps you to detect and recover from failures.

Oink

Oink provides a REST based interface for PIG execution. Oink is Pig on Servlet which provides the following functionalities: register/unregister/view a Pig script, register/unregister/view udf, execute a Pig job, view the status/stats of a Pig job, cancel a Pig job.

Hannibal

Hannibal is a tool to help monitor and maintain HBase-Clusters that are configured for manual splitting. While HBase provides metrics to monitor overall cluster health via JMX or Ganglia, it lacks the ability to monitor single regions in an easy way. This information is essential when your cluster is configured for manual splits, especially when the data growth is not uniform.

Genie

Genie provides job and resource management for the Hadoop ecosystem in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides a REST-ful Execution Service to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. And from the perspective of a Hadoop administrator, Genie provides a set of Configuration Services, which serve as a registry for clusters, and their associated Hive and Pig configurations.

Serengeti

Serengeti is an Open Source project initiated by VMware to enable the rapid deployment of an Apache Hadoop cluster (HDFS, MapReduce, Pig, Hive, ..) on a virtual platform.

Apache Whirr

Apache Whirr is a set of libraries for running cloud services. Whirr provides: a cloud-neutral way to run services; a common service API; smart defaults for services.

Knox

Develop jointly between Hortonworks and Microsoft, the Knox Gateway(“Gateway” or “Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.

Rhino

As Hadoop extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with all Hadoop projects and HBase must be coupled with protection for private information that limits performance impact. Project Rhino is an Intel Open Source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Cloudera Sentry

Sentry is an authorization module for Hadoop that provides the granular, role-based authorization required to provide precise levels of access to the right users and applications. Its new support for role-based authorization, fine-grained authorization, and multi-tenant administration allows Hadoop operators to: Store more sensitive data in Hadoop, give more end-users access to that data in Hadoop, create new use cases for Hadoop, enable multi-user applications, and comply with regulations (e.g., SOX, PCI, HIPAA, EAL3).

Pallet

Pallet is a Hadoop cluster management tool with Intelligent Defaults.

Helix

Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.

Sahara

The Sahara project provides a simple means to provision a Hadoop cluster on top of OpenStack. It’s ex Savanna project, renamed due to the potential trademark issues.

Exhibitor

Exhibitor is a supervisor system for ZooKeeper.

osquery

With osquery, you can use SQL to query low-level operating system information. Under the hood, instead of querying static tables, these queries dynamically execute high-performance native code. The results of the SQL query are transparently returned to you quickly and easily.

Monitor, Benchmark

Ganglia

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters.

Nagios

Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes.

White Elephant

White Elephant is a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users.

Timberlake

Timberlake is a Job Tracker for Hadoop. It improves on existing Hadoop job trackers by providing a lightweight realtime view of your running and finished MapReduce jobs. Timberlake exposes the counters and configuration that are the most useful, allowing you to get a quick overview of the whole cluster or dig into the performance and behavior of a single job.

HiTune

HiTune is a Hadoop performance analyzer. It consists of three major components: Tracker - lightweight agents running on each node in the Hadoop cluster to collect runtime information, including sysstat, Hadoop metrics and job history, and Java instrumentations; Aggregation Engine - distributed framework that merges the results of the trackers, which is currently implemented using Chukwa; Analysis Engine - the program that conducts the performance analysis and generates the analysis report, which is implemented as a series of Hadoop jobs.

HiBench

The Hadoop Benchmark Suite (HiBench) contains 10 typical Hadoop workloads (including micro benchmarks, HDFS benchmarks, web search benchmarks, machine learning benchmarks, and data analytics benchmarks). This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib).

HDFS-DU

HDFS-DU is an interactive visualization of the Hadoop distributed file system. The project aims to monitor different snapshots for the entire HDFS system in an interactive way, showing the size of the folders, the rate at which the size increases / decreases, and to highlight inefficient file storage.

Book and documentation

Data-Intensive Text Processing with MapReduce

This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

Share this article