Data is Future: Pyspark

Tuesday, February 12, 2019

Pyspark - groupByKey

groupByKey() operates on Pair RDDs and is used to group all the values related to a given key.It takes (k,v) key , value pair as input and produces RDD with key and list of values.It is a wide transformation and the data shuffle across many partitions.It is a transformation operation which means its evaluation is lazy.

Let us take a file country.txt and check the number of values in the file.

Code Snippet to illustrate the groupByKey operation in Spark .

Output of the above code :-

Some of the key points of the groupByKey() opertaion :-
    • Apache spark groupByKey is a transformation operation hence its evaluation is lazy
    • It is a wide operation as it shuffles data from multiple partitions and create another RDD
    • This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
    • Not recommended to use when you need to do further aggregation on grouped data
    • groupByKey always results in Hash-Partitioned RDDs

The code and sample is available in my github id :-
https://github.com/sangam92/Spark_tutorials

Data is Future

Tuesday, February 12, 2019

Pyspark - groupByKey

No comments:

Post a Comment

Delta Lake - Time Travel