groupByKey() operates on Pair RDDs and is used to group all the values related to a given key.It takes (k,v) key , value pair as input and produces RDD with key and list of values.It is a wide transformation and the data shuffle across many partitions.It is a transformation operation which means its evaluation is lazy.
Let us take a file country.txt and check the number of values in the file.
Code Snippet to illustrate the groupByKey operation in Spark .
Output of the above code :-
Some of the key points of the groupByKey() opertaion :-
• Apache spark groupByKey is a transformation operation hence its evaluation is lazy
• It is a wide operation as it shuffles data from multiple partitions and create another RDD
• This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
• Not recommended to use when you need to do further aggregation on grouped data
• groupByKey always results in Hash-Partitioned RDDs
The code and sample is available in my github id :-
https://github.com/sangam92/Spark_tutorials
Let us take a file country.txt and check the number of values in the file.
Code Snippet to illustrate the groupByKey operation in Spark .
Output of the above code :-
Some of the key points of the groupByKey() opertaion :-
• Apache spark groupByKey is a transformation operation hence its evaluation is lazy
• It is a wide operation as it shuffles data from multiple partitions and create another RDD
• This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
• Not recommended to use when you need to do further aggregation on grouped data
• groupByKey always results in Hash-Partitioned RDDs
The code and sample is available in my github id :-
https://github.com/sangam92/Spark_tutorials
No comments:
Post a Comment