Saturday, February 16, 2019

Spark - Coalesce


Coalesce is one of the RDD methods that was introduced since spark 1.4.The actual work of coalesce is to change the number of partitions.The number of partitions normally changes when we implement the coalesce.It tries to minimize data movement by avoiding network shuffle and creates unequal sized partition.

Let us take an example and try to understand the coalesce.

We will create a local data set and try to understand the effect of coalesce on the number of partition.



Here , we are reading a data set and will keep the data into 5 different partitions.We have saved the data in a directory to keep the record of how data has been divided into different partitions.


We are having 5 different files that has been created . These files are having a distributed data set .


Once,we will use the RDD operation using coalesce, the data will be segregated into the specified number of partitions.


Here, we have only two partitions as mentioned in our code.We can also check how the data set is divided in between these partitions.


We can check how many partitions has been created via Spark UI .



When to use the coalesce ?

Once we start reading the data using Spark, the data is loaded into the different partitions .once we have any transformation that reduces the total number of data , there will be some partitions which are having very less data.coalesce will reduce the number of partitions and consolidates the data in these minimum number of partitions.

One thing should be noted about coalesce is that it will either reduces the number of partition or will keep the number of partition as it is.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...