MapReduce introduction

Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must be able to serve them with the best recommendations while ensuring the sustainability of their systems.

Description

MapReduce is a way of modeling a program to handle large volumes of data. By broad, we mean massive, for example of the order of petabytes. Originally created by Google and described in detail in the publication ”MapReduce: Simplified Data Processing on Large Clusters” published in 2004, an Open Source implementation exists at through Hadoop and its ecosystem inside the Apache Foundation.

A MapReduce task consists of 2 phases. The developer implements a “map” function that decomposes data into a key and values and another “reduce” function that merges all the values associated with the same key. Combined, this paradigm makes it possible to express a large number of problems.

The great advantage of this method is its ability to break a process into multiple distributable tasks on a very large number of normal machines. By normal machines, we mean servers whose price can vary between 3000 and 5000 euros. To take a concrete example, in 2010, we ordered 4500 € servers consisting of 2 processors AMD Optéron 8 hearts, 32 GB of RAM and 4 disks of 1T at 7500 turns. Two machines at 2000 euros each could have done the trick. We opted for this configuration because of the saving of space generated.

Data processing is distributed across all servers in the cluster with limited scale penalty. If your data doubles, you double the number of machines. If you need more computing power, ditto.

For the programmer, the work is limited to creating MapReduce tasks that are easy to understand and write. The system takes care of details including partitioning of data, execution and coordination of tasks, duplication of information in case of machine failure and communication between them.

I will end this article with an example in JavaScript to illustrate the concept. The purpose of the exercise is to count the number of users for the same postal code, starting from CSV data including 3 users with their name and postal code as fields.

The source file (CSV format)

Hadoop, 75006
Cassandra, 75019
Hive 75006

The “map” function

The argument provided, “value” corresponds to a line of our source file. This line is converted into an array and consists of 2 elements: the user’s name and postal code. Finally, the “emit” function takes 2 arguments which are the key to be issued and the associated value is the postal code for the key and 1 to signify that this postal code has been met once.

function map (row) {
  row = row.split (',');
  this.emit (row [1], 1);
}

The “reduce” function

The arguments provided are a key and the values associated with that key. They result from the “map” function called previously. The “key” argument is therefore a postal code and the “values” argument is an array of numbers. Be careful though, the reduce function can be called multiple times and its write must take this into account. Since MapReduce is intended for very large volumes, values could otherwise be too large. Here, values are made up of numbers “1” the first time but this number will be different if the method is called again.

function reduce (key, values) {
  return values.reduce (function (previous, current) {
    return previous + current
  })
}

The final result

75006, 2
75019, 1

Share this article