Data is Future: Map Reduce

Monday, July 23, 2018

Map Reduce - Reducer

In last blog post ,we have gone through the basics of the mapper .Now ,we need to understand the reducer .The output of the mapper will act as the input for the reducer.The input to reducer is in the form of <Key,Value> pair.The output of the reducer is the final output. This final output of the reducer is stored in the HDFS. It is the second phase of the map reduce process.

The reducer mainly comprises of three phases :-
1.) Shuffle
2.) Sort
3.) Reduce

Anatomy of Reducer :-

The map tasks are sitting on the different machine on which it ran in the cluster. These tasks need to be moved to the machine where reduce tasks need to be started .since the map tasks need output from each of the map tasks , the map tasks start copying the output to the reducer as soon as the map tasks finishes.This process is known as copying phase.

The reduce task has small number of copier thread so that it can copies many jobs in parallel.Every mapper sends a heartbeat to the application master and it acknowledges the master that the mapper task has been finished and is ready to get copied.The output of the mapper is not deleted until it has been copied to the reducer machine and same is acknowledged by the master.when all the mapper task is copied then the reduce task moves into the sort phase.

In Sort phase , all the outputs that are coming from the different mappers are merged together and maintaining the sorting order of the mapper .This is done in rounds. For example, if there were 50 map outputs and the merge factor was10 (there would be five rounds. Each round would merge 10 files into1, so at the end there would be 5 intermediate files. Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase.

This final merge can come from a mixture of in-memory and on-disk segments.In the reduce phase the output are finally sorted and written into the HDFS .

Hadoop Reducer does aggregation or summation sort of computation by three phases(shuffle, sort and reduce). Thus, HDFS Stores the final output of Reducer.

Data is Future

Monday, July 23, 2018

Map Reduce - Reducer

No comments:

Post a Comment

Delta Lake - Time Travel