Data is Future: Persist Vs Cache in Spark

Sunday, March 4, 2018

Persist Vs Cache in Spark

Whenever there is a RDD creation ,it is accompanied with a lot of query fired by us on that RDD.We can re frame like that a number of action take place on that RDD.

What can be done in such scenario ?

Spark can save the partial result and reuse it a number of times.This can reduce the extra process of creating the RDD again and again.
By default , Apache Spark do not persist the data and we need to instruct the spark that we need persistence.In fact Persisting is one of the optimization technique .

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x).

How to Persist the RDD ?

from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)
rdd_create = sc.textFile('test.txt')
rdd_create.persist
rdd_count = rdd_create.first()
rdd_count2= rdd_create.count()
print('The first line is' ,rdd_count)
print('The count of lines is ',rdd_count2)

Difference between Persist and cache:-

Cache can use only the default memory while in persist we can use different memory management technique.

Persist:-

Caching :-

We will learn more about the memory management in our different post.

Further Reading :-https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

Code Download :- https://github.com/sangam92/Spark_tutorials

Data is Future

Sunday, March 4, 2018

Persist Vs Cache in Spark

No comments:

Post a Comment

Delta Lake - Time Travel