Coalesce
is one of the RDD methods that was introduced since spark 1.4.The
actual work of coalesce is to change the number of partitions.The
number of partitions normally changes when we implement the
coalesce.It tries to minimize data movement by avoiding network
shuffle and creates unequal sized partition.
Let
us take an example and try to understand the coalesce.
We will create a local data set
and try to understand the effect of coalesce on the number of
partition.
Here
, we are reading a data set and will keep the data into 5 different
partitions.We have saved the data in a directory to keep the record of
how data has been divided into different partitions.
We
are having 5 different files that has been created . These files are
having a distributed data set .
Once,we will use the RDD operation using coalesce, the data will be segregated into the specified number of partitions.
Here, we have only two partitions as mentioned in our code.We can also
check how the data set is divided in between these partitions.
We
can check how many partitions has been created via Spark UI .
When
to use the coalesce ?
Once
we start reading the data using Spark, the
data is loaded into the different partitions .once we have any
transformation that reduces the
total number of data , there will be some partitions which are
having very less data.coalesce will reduce the number of partitions
and consolidates the data in these minimum number of partitions.
One
thing should
be noted about coalesce is that it will either reduces the number of
partition or will keep the number of partition as it is.
No comments:
Post a Comment