Tuesday, February 12, 2019

Pyspark - groupByKey

groupByKey() operates on Pair RDDs and is used to group all the values related to a given key.It takes (k,v) key , value pair as input and produces RDD with key and list of values.It is a wide transformation and the data shuffle across many partitions.It is a transformation operation which means its evaluation is lazy.


Let us take a file country.txt and check the number of values in the file.




Code Snippet to illustrate the groupByKey operation in Spark .




Output of the above code :-




Some of the key points of the groupByKey() opertaion :-
    • Apache spark groupByKey is a transformation operation hence its evaluation is lazy
    • It is a wide operation as it shuffles data from multiple partitions and create another RDD
    • This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
    • Not recommended to use when you need to do further aggregation on grouped data
    • groupByKey always results in Hash-Partitioned RDDs


The code and sample is available in my github id :-
https://github.com/sangam92/Spark_tutorials

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...