Data is Future: February 2019

Friday, February 22, 2019

Spark - cogroup

Spark cogroup is a transformation that works with key,value pairRDD ’s and primarily use to group these RDD’s.

Syntax :- cogroup(dataset,[numpartitions])

when we have a data set of (k,v) and (k,w) , then the resulting output will be

(k,iterable(v),iterable(w)) tuples.

Let us take an example and try to understand it.

We will create two data sets.Here first data set will be batsman_rank.txt while the second dataset will be bowler_rank.txt.

We will write a simple code two apply cogroup on these datasets and will try to check the result.

Output :-

The cogroup is conceptually equivalent to full outer join. If an RDD does not have an element for a given key that is present in other RDD , then the corresponding iterable will be empty.

The code and dataset is available in my Git repository :-

https://github.com/sangam92/Spark_tutorials

Friday, February 22, 2019

Spark - cogroup

Delta Lake - Time Travel