Data is Future: PySpark

Wednesday, February 13, 2019

PySpark - reduceByKey

In our previous blog post , we have gone through the groupByKey .In this post we will go through the reduceByKey .

reduceByKey is almost similar to groupByKey . It is a wide transformation and lazily evaluated.There is a chance that the data shuffling may happen across the partition. It works only with the RDD which are in the form of key,value pair.

The final output in the case of reduceByKey is similar to groupByKey but the difference lies in the way the calculation is done.The reduceByKey uses combiner which evaluates the key locally and then this evaluated data moves to the driver program.This leads to a small network congestion as compare to groupByKey.

Syntax :- reduceByKey(function)

Code Snippet to illustrate the reduceByKey :-

OutPut :-

reduceByKey is a transformation operation in Spark hence it is lazily evaluated

It is a wide operation as it shuffles data from multiple partitions and creates another RDD
Before sending data across the partitions, it also merges the data locally using the same associative function for optimized data shuffling
It can only be used with RDDs which contains key and value pairs kind of elements
It accepts a Commutative and Associative function as an argument
- The parameter function should have two arguments of the same data type
- The return type of the function also must be same as argument types

You can find the code in my Github id :- https://github.com/sangam92/Spark_tutorials

Data is Future

Wednesday, February 13, 2019

PySpark - reduceByKey

No comments:

Post a Comment

Delta Lake - Time Travel