Spark
cogroup is a transformation that works with key,value pairRDD ’s
and primarily use to group these RDD’s.
Syntax
:- cogroup(dataset,[numpartitions])
when
we have a data set of (k,v) and (k,w) , then the resulting output will
be
(k,iterable(v),iterable(w))
tuples.
Let
us take an example and try to understand it.
We
will create two data sets.Here first data set will be batsman_rank.txt
while the second dataset will be bowler_rank.txt.
We
will write a simple code two apply cogroup on these datasets and will
try to check the result.
Output
:-
The
cogroup is conceptually equivalent to full outer join. If an RDD does
not have an element for a given key that is present in other RDD ,
then the corresponding iterable will be empty.
The
code and dataset is available in my Git repository :-