Friday, February 22, 2019

Spark - cogroup


Spark cogroup is a transformation that works with key,value pairRDD ’s and primarily use to group these RDD’s.


Syntax :- cogroup(dataset,[numpartitions])


when we have a data set of (k,v) and (k,w) , then the resulting output will be
(k,iterable(v),iterable(w)) tuples.

Let us take an example and try to understand it.

We will create two data sets.Here first data set will be batsman_rank.txt while the second dataset will be bowler_rank.txt.





We will write a simple code two apply cogroup on these datasets and will try to check the result.



Output :-



The cogroup is conceptually equivalent to full outer join. If an RDD does not have an element for a given key that is present in other RDD , then the corresponding iterable will be empty.


The code and dataset is available in my Git repository :-


Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...