In
our previous blog post , we have gone through the groupByKey .In this
post we will go through the reduceByKey .
reduceByKey
is almost similar to groupByKey . It is a wide transformation and
lazily evaluated.There is a chance that the data shuffling may happen
across the partition. It works only with the RDD which are in the
form of key,value pair.
The
final output in the case of reduceByKey is similar to groupByKey but
the difference lies in the way the calculation is done.The
reduceByKey uses combiner which evaluates the key locally and then
this evaluated data moves to the driver program.This leads to a small
network congestion as compare to groupByKey.
Syntax
:- reduceByKey(function)
Code
Snippet to illustrate the reduceByKey :-
OutPut
:-
-
reduceByKey is a
transformation operation in Spark hence it is lazily evaluated
-
It is a wide operation as it shuffles data from multiple partitions and creates another RDD
-
Before sending data across the partitions, it also merges the data locally using the same associative function for optimized data shuffling
-
It can only be used with RDDs which contains key and value pairs kind of elements
-
It accepts a Commutative and Associative function as an argument
-
The parameter function should have two arguments of the same data type
-
The return type of
the function also must be same as argument types
-
You
can find the code in my Github id :-
https://github.com/sangam92/Spark_tutorials
No comments:
Post a Comment