Wednesday, February 13, 2019

PySpark - reduceByKey


In our previous blog post , we have gone through the groupByKey .In this post we will go through the reduceByKey .

reduceByKey is almost similar to groupByKey . It is a wide transformation and lazily evaluated.There is a chance that the data shuffling may happen across the partition. It works only with the RDD which are in the form of key,value pair.

The final output in the case of reduceByKey is similar to groupByKey but the difference lies in the way the calculation is done.The reduceByKey uses combiner which evaluates the key locally and then this evaluated data moves to the driver program.This leads to a small network congestion as compare to groupByKey.

Syntax :- reduceByKey(function)

Code Snippet to illustrate the reduceByKey :-



OutPut :-

  • reduceByKey is a transformation operation in Spark hence it is lazily evaluated
  • It is a wide operation as it shuffles data from multiple partitions and creates another RDD
  • Before sending data across the partitions, it also merges the data locally using the same associative function for optimized data shuffling
  • It can only be used with RDDs which contains key and value pairs kind of elements
  • It accepts a Commutative and Associative function as an argument
    • The parameter function should have two arguments of the same data type
    • The return type of the function also must be same as argument types
You can find the code in my Github id :- https://github.com/sangam92/Spark_tutorials

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...