MapReduce introduction

MapReduce introduction

By David WORMS

Jun 26, 2010

Categories: Big Data | Tags: MapReduce, Big Data, Java, JavaScript

Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must be able to serve them with the best recommendations while ensuring the sustainability of their systems.

Description

MapReduce is a way of modeling a program to handle large volumes of data. By broad, we mean massive, for example of the order of petabytes. Originally created by Google and described in detail in the publication ”MapReduce: Simplified Data Processing on Large Clusters” published in 2004, an Open Source implementation exists at through Hadoop and its ecosystem inside the Apache Foundation.

A MapReduce task consists of 2 phases. The developer implements a “map” function that decomposes data into a key and values and another “reduce” function that merges all the values associated with the same key. Combined, this paradigm makes it possible to express a large number of problems.

The great advantage of this method is its ability to break a process into multiple distributable tasks on a very large number of normal machines. By normal machines, we mean servers whose price can vary between 3000 and 5000 euros. To take a concrete example, in 2010, we ordered 4500 € servers consisting of 2 processors AMD Optéron 8 hearts, 32 GB of RAM and 4 disks of 1T at 7500 turns. Two machines at 2000 euros each could have done the trick. We opted for this configuration because of the saving of space generated.

Data processing is distributed across all servers in the cluster with limited scale penalty. If your data doubles, you double the number of machines. If you need more computing power, ditto.

For the programmer, the work is limited to creating MapReduce tasks that are easy to understand and write. The system takes care of details including partitioning of data, execution and coordination of tasks, duplication of information in case of machine failure and communication between them.

I will end this article with an example in JavaScript to illustrate the concept. The purpose of the exercise is to count the number of users for the same postal code, starting from CSV data including 3 users with their name and postal code as fields.

The source file (CSV format)

Hadoop, 75006
Cassandra, 75019
Hive 75006

The “map” function

The argument provided, “value” corresponds to a line of our source file. This line is converted into an array and consists of 2 elements: the user’s name and postal code. Finally, the “emit” function takes 2 arguments which are the key to be issued and the associated value is the postal code for the key and 1 to signify that this postal code has been met once.

function map (row) {
  row = row.split (',');
  this.emit (row [1], 1);
}

The “reduce” function

The arguments provided are a key and the values associated with that key. They result from the “map” function called previously. The “key” argument is therefore a postal code and the “values” argument is an array of numbers. Be careful though, the reduce function can be called multiple times and its write must take this into account. Since MapReduce is intended for very large volumes, values could otherwise be too large. Here, values are made up of numbers “1” the first time but this number will be different if the method is called again.

function reduce (key, values) {
  return values.reduce (function (previous, current) {
    return previous + current
  })
}

The final result

75006, 2
75019, 1

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.