our previous blog post , we have gone through the groupByKey .In this
post we will go through the reduceByKey .
is almost similar to groupByKey . It is a wide transformation and
lazily evaluated.There is a chance that the data shuffling may happen
across the partition. It works only with the RDD which are in the
form of key,value pair.
final output in the case of reduceByKey is similar to groupByKey but
the difference lies in the way the calculation is done.The
reduceByKey uses combiner which evaluates the key locally and then
this evaluated data moves to the driver program.This leads to a small
network congestion as compare to groupByKey.
:- reduceByKey(function)
Snippet to illustrate the reduceByKey :-
reduceByKey is a
transformation operation in Spark hence it is lazily evaluated
It is a wide operation as it shuffles data from multiple partitions and creates another RDD
Before sending data across the partitions, it also merges the data locally using the same associative function for optimized data shuffling
It can only be used with RDDs which contains key and value pairs kind of elements
It accepts a Commutative and Associative function as an argument
The parameter function should have two arguments of the same data type
The return type of
the function also must be same as argument types
can find the code in my Github id :-
No comments:
Post a Comment