Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grow even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume, Oozie, Avro, Sqoop, Cascading, Cascalog, HUE, Mahout to name a few. At the same time, many open source projects are appearing on GitHub and elsewhere or being introduced Apache Incubator.
- Ingest/Collect: Flume, Scribe, Sqoop, HIHO, Chukwa
- Workflow, Coordination, ETL: Oozie, Cascading, Crunch, Falcon, Morphlines
- Map/Reduce Processing: Pig, Lipstick, Cascalog, Scoobi, Scalding
- SQL querying: Hive, Tajo, Impala, Splout, Phoenix, Drill, Shark, MRQL
- Graph processing: Giraph, Faunus, Titan, Pegasus
- Statistics, text mining: Mahout, PredictionIO, Cloudera ML, RHadoop, Mavuno, Bixo, SnowPlow
- Compute engine, Search: Weave, Hama, Pangool, Kiji
- Stream: Storm, Summingbird, Samza, Jubatus
- Databases: HBase, ElephantDB, Accumulo, Omid, Blur, Gora, Culvert, Panthera ASE, Panthera DOT, OpenTSDB
- MapReduce libraries, Hive UDFs, drivers: GIS Tools for Hadoop, Hive map udaf, Mongo-Hadoop Adapter, Quest Sqoop plugin
- File Formats: Trevni, Parquet, Hive ORC file
- Publish/Subscribe System: BookKeeper, Kafka
- Describe, Develop: DataFu, Elephant Bird, Tez, HCatalog, HDT, BIRT, Karmasphere, hRaven
- Job scheduling: Mezo, Corona
- Deploy, Monitor, Administer: Hue, Ganglia, Ambari, Hannibal, Genie, Whirr, Knox, Rhino, Sentry, Pallet, Helix, White Elephant, HiTune
- Book and documentation: Data-Intensive Text Processing with MapReduce
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Scribe is a server for aggregating log data that’s streamed in real time from clients. It is designed to be scalable and reliable.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
HIHO, for Hadoop In, Hadoop Out, is a Data Integration tool for Hadoop with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Workflow, Coordination, ETL
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Cascading is a data processing API and processing query planner used for defining, sharing, and executing robust Data Analytics and Data Management applications on Apache Hadoop. Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.
The Apache Crunch (incubating) Java library provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google’s FlumeJava library. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
For now distributed as part of the Cloudera Development Kit, Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes.
Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing “Big Data” on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools. o substantial parallelization, which in turns enables them to handle very large data sets.
Scoobi focuses on making you more productive at building Hadoop applications. Scoobi is a library that leverages the Scala programming language to provide a programmer friendly abstraction around Hadoop’s MapReduce to facilitate rapid development of analytics and machine-learning algorithms.
Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight intergation with Scala, bringing advantages of Scala to your MapReduce jobs.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.
Cloudera Impala is a distributed query execution engine that runs against data stored natively in Apache HDFS and Apache HBase.
Lingual executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.
Phoenix is a SQL layer over HBase delivered as an embedded JDBC driver targeting low latency queries over HBase data. Tables are created and updated through DDL statements and stored and versioned on the server in an HBase table. Columns are defined as either being part of a multi-part row key or as key value cells. The Phoenix query engine transforms your SQL query into one or more HBase scans and orchestrates their execution to produce standard JDBC result sets.
Apache Drill is a distributed system for interactive analysis of large-scale datasets, based on Google’s Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.
Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to the existing data nor queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in two modes: in MapReduce mode on top of Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of Apache Hama. The MRQL query language is powerful enough to express most common data analysis tasks over many forms of raw in-situ data, such as XML and JSON documents, binary files, and CSV documents. MRQL is more powerful than other current high-level MapReduce languages, such as Hive and PigLatin, since it can operate on more complex data and supports more powerful query constructs, thus eliminating the need for using explicit MapReduce code. With MRQL, users will be able to express complex data analysis tasks, such as PageRank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing system will be able to compile these queries to efficient Java code.
[Apache Giraph] implement a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.
Faunus is a property graph analytics engine based on Hadoop. A breadth-first version of the graph traversal language Gremlin operates on a vertex-centric property graph data structure. Faunus can be extended with new operations written using MapReduce and Blueprints.
Titan is a highly scalable graph database optimized for storing and querying large graphs with billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users. It support for various storage backends (Apache Cassandra, Apache HBase, Oracle BerkeleyDB) and various indexing backends (ElasticSearch, Apache Lucene).
PEGASUS is a Peta-scale graph mining system, fully written in Java. It runs in parallel, distributed manner on top of Hadoop. Existing works on graph mining has limited scalability: usually, the maximum graph size is order of millions. PEGASUS breaks the limit by scaling up the algorithms to billion-scale graphs.
Statistics, text mining
The Apache Mahout™ machine learning library’s goal is to build scalable machine learning libraries.
PredictionIO is designed for developers to rapidly add predictive features, such as recommendation, personalization and user behavior prediction, into their applications. We built PredictionIO in Scala, with Play Framework as the REST API server backbone. It comes with Mahout algorithms as well as those written in Cascading and Scalding. Hadoop, MongoDB and Quartz make its infrastructure scalable.
Cloudera ML is a collection of Java libraries and commandline tools for performing certain data preparation and analysis tasks that are often referred to as “advanced analytics” or “machine learning.” Our focus is on simplicity, reliability, easy model interpretation, minimal parameter tuning, and integration with other tools for data preparation and analysis.
RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) & (CDH4). and R 2.15.0. THe packages have also been tested with Revolution R 4.3, 5.0, and 6.0.
Mavuno is an open source, modular, scalable text mining toolkit built upon Hadoop. It supports basic natural language processing tasks (e.g., part of speech tagging, chunking, parsing, named entity recognition), is capable of large-scale distributional similarity computations (e.g., synonym, paraphrase, and lexical variant mining), and has information extraction capabilities (e.g., instance and semantic relation mining). It can easily be adapted to new input formats and text mining tasks.
Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.
SnowPlow is an open-source web-scale analytics, powered by Hadoop and Hive. It does three things: identifies users, and tracks the way they engage with a website or app; stores the associated behavioural data in a scalable “clickstream” data warehouse; makes it possible to leverage a big data toolset (e.g. Hive, Pig, Mahout) to analyse that event data.
Compute engine, Search
Weave simplify the writing of distributed application on YARN and aims at making it as simple as running threads on JVM.
Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.
The Kiji Project is a modular, open-source framework based on HBase that enables developers and analysts to collect, analyze and use data in real-time applications. Product and content recommendation systems, risk analysis monitoring, customer profile and segmentation applications, energy usage analytics & reporting are such exemples of Kiji usecase.
Lily is a data management platform based on HBase combining planet-sized data storage, indexing and search with on-line, real-time usage tracking, audience analytics and content recommendations
Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Not directly a Hadoop component, it is often used conjointly and it is more and more integrated with it. More specifically, it support Hadoop style security mechanisms and it is being integrated into Hadoop YARN for resource management. See this Yahoo! post for more details.
Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding. You can execute a Summingbird program in “batch mode” (using Scalding), in “realtime mode” (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.
Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
(not directly a Hadoop component) Jubatus is the first open source platform for online distributed machine learning on the data streams of Big Data. It uses a loose model sharing architecture for efficient training and sharing of machine learning models, by defining three fundamental operations; Update, Mix, and Analyze, in a similar way with the Map and Reduce operations in Hadoop.
Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable.
ElephantDB is a database that specializes in exporting key/value data from Hadoop. ElephantDB is composed of two components. The first is a library that is used in MapReduce jobs for creating an indexed key/value dataset that is stored on a distributed filesystem. The second component is a daemon that can download a subset of a dataset and serve it in a read-only, random-access fashion.
The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
The Omid project provides transactional support for key-value stores using Snapshot Isolation. Omid stands for Optimistically transactional Management in Datastores. At this stage of the project, HBase is the only supported data-store.
Blur combines the speed of document-oriented databases with the ability to build rich data models to deliver lightening-fast answers to complex queries. It leverages multiple open source projects including Hadoop, HDFS, Lucene, Thrift and Zookeeper to create an environment where structured data can be transformed into a sharded index that runs on a distributed (cloud) computing environment.
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.
Culvert is a secondary indexing platform for BigTable, which means it provides everything you need to write indexes and use them to access and well, index your data. Currently Culvert supports HBase and Accumulo, though new adapters are in the pipeline.
Panthera ASE, a new analytical SQL engine for Hadoop, provide full SQL support for OLAP applications on Hadoop. It enables complex and sophisticated SQL queries on Hadoop through a series of advanced transformations and optimizations (e.g., subquery unnesting), and currently uses Hive as its execution backend (with necessary enhancements to the underlying stack being added and contributed back to the existing Apache projects).
Panthera DOT, document oriented table on HBase, provides an efficient storage engine for relational SQL query with high-update rate by leveraging the SQL relation mondel to provide document semantics, and is currently implemented as an HBase co-processor application.
OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase which helps to store, index and serve metrics collected from computer systems at a large scale. Like Ganglia, OpenTSDB can be used to monitor various systems including HBase.
MapReduce libraries, Hive UDFs, drivers
GIS Tools for Hadoop is an open source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries: - Esri Geometry API for Java: A generic geometry library, can be used to extend Hadoop core with vector geometry types and operations, and enables developers to build MapReduce applications for spatial data. - Spatial Framework for Hadoop: Extends Hive and is based on the Esri Geometry API, to enable Hive Query Language users to leverage a set of analytical functions and geometry types. In addition to some utilities for JSON used in ArcGIS. - Geoprocessing Tools for Hadoop: Contains a set of ready to use ArcGIS Geoprocessing tools, based on the Esri Geometry API and Spatial Framework for Hadoop. Developers can download the source code of the tools and customize it; they can also create new tools and contribute it to the open source project. Through these tools ArcGIS users can move their spatial data and execute a pre-defined workflow inside Hadoop.
The project contains two Hive UDAF which convert an aggregation into a map and an ordered map.
The mongo-hadoop adapter is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.
Quest® Data Connector for Oracle and Hadoop is an optional plugin to Sqoop. It facilitates the movement of data between Oracle and Hadoop. Quest® Data Transporter for Hive is distributed with Quest® Data Connector for Oracle and Hadoop. Quest® Data Transporter for Hive is a Java command-line utility that allows you to execute a Hive query and insert the results into an Oracle table.
Trevni defines a column-major file format for datasets achieving a higer query performance. To permit scalable, distributed query evaluation, datasets are partitioned into row groups, containing distinct collections of rows. Each row group is organized in column-major order, while row groups form a row-major partitioning of the entire dataset.
Parquet is created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.
The ORC file format is a new columnar file format introduced by Hortonworks as part of the Stinger Initiative.
Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following: persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages; high-throughput, even with very modest hardware Kafka can support hundreds of thousands of messages per second; explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics; support for parallel data load into Hadoop.
DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics.
Elephant Bird is Twitter’s open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.
Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects.
Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes providing a shared schema and data type mechanism, providing a table abstraction so that users need not be concerned with where or how their data is stored and providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
The Hadoop Development Tools (HDT) is a set of plugins for the Eclipse IDE for developing against the Hadoop platform.
From the Eclipse Foundation, BIRT is an open source Eclipse-based reporting system that integrates with your Java/Java EE application to produce compelling reports. Starting with the Kepler version of Eclipse, Birt provides support to query HDFS, Hive, MongoDB and Cassandra.
Natively designed for Hadoop, Karmasphere is a unified workspace for full-fidelity Big Data Analytics that provides access to all the data in its original form to preserve richness and flexibility. An open, standards-based solution for teams of data and business analysts who need to quickly and easily ingest, analyze and visualize Big Data on Hadoop, Karmasphere does not require users to be retrained in exotic technologies. Exploratory questions can be asked using familiar SQL. Data can also be visually explored using a drag and drop interface. Industry standard pre-packaged analytics (UDFs) can be used as building blocks to speed time to insight. Unlike other solutions, you do not need to prepare the data or know the answer before asking the questions.
hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.
Apache Mezo is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications.
Hadoop Corona is the next version of Map-Reduce from Facebook. The current Map-Reduce has a single Job Tracker that reached its limits at Facebook. The Job Tracker manages the cluster resource and tracks the state of each job. In Hadoop Corona, the cluster resources are tracked by a central Cluster Manager.
Deploy, Monitor, Administer
Hue is an open source web-based application for making it easier to use Apache Hadoop.
Hue features a file browser for HDFS, an Oozie Application for creating workflows and coordinators, a job designer/browser for MapReduce, Hive and Cloudera Impala query editors, a Shell, a collection of Hadoop API.
(not directly a Hadoop component) Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters.
Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari consists of a set of RESTful APIs and browser-based management console UI.
Hannibal is a tool to help monitor and maintain HBase-Clusters that are configured for manual splitting. While HBase provides metrics to monitor overall cluster health via JMX or Ganglia, it lacks the ability to monitor single regions in an easy way. This information is essential when your cluster is configured for manual splits, especially when the data growth is not uniform.
Genie provides job and resource management for the Hadoop ecosystem in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides a REST-ful Execution Service to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. And from the perspective of a Hadoop administrator, Genie provides a set of Configuration Services, which serve as a registry for clusters, and their associated Hive and Pig configurations.
Serengeti is an open source project initiated by VMware to enable the rapid deployment of an Apache Hadoop cluster (HDFS, MapReduce, Pig, Hive, ..) on a virtual platform.
Apache Whirr is a set of libraries for running cloud services. Whirr provides: a cloud-neutral way to run services; a common service API; smart defaults for services.
Develop jointly between Hortonworks and Microsoft, the Knox Gateway (“Gateway” or “Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.
As Hadoop extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with all Hadoop projects and HBase must be coupled with protection for private information that limits performance impact. Project Rhino is an Intel open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.
Sentry is an authorization module for Hadoop that provides the granular, role-based authorization required to provide precise levels of access to the right users and applications. Its new support for role-based authorization, fine-grained authorization, and multi-tenant administration allows Hadoop operators to: Store more sensitive data in Hadoop, give more end-users access to that data in Hadoop, create new use cases for Hadoop, enable multi-user applications, and comply with regulations (e.g., SOX, PCI, HIPAA, EAL3).
Pallet is a Hadoop cluster management tool with Intelligent Defaults.
Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.
White Elephant is a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users.
HiTune is a Hadoop performance analyzer. It consists of three major components: Tracker - lightweight agents running on each node in the Hadoop cluster to collect runtime information, including sysstat, Hadoop metrics and job history, and Java instrumentations; Aggregation Engine - distributed framework that merges the results of the trackers, which is currently implemented using Chukwa; Analysis Engine - the program that conducts the performance analysis and generates the analysis report, which is implemented as a series of Hadoop jobs.
Book and documentation
This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.