Whenever there is a RDD creation ,it is accompanied with a lot of query fired by us on that RDD.We can re frame like that a number of action take place on that RDD.
What can be done in such scenario ?
Spark can save the partial result and reuse it a number of times.This can reduce the extra process of creating the RDD again and again.
By default , Apache Spark do not persist the data and we need to instruct the spark that we need persistence.In fact Persisting is one of the optimization technique .
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x).
How to Persist the RDD ?
from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)
rdd_create = sc.textFile('test.txt')
rdd_create.persist
rdd_count = rdd_create.first()
rdd_count2= rdd_create.count()
print('The first line is' ,rdd_count)
print('The count of lines is ',rdd_count2)
Difference between Persist and cache:-
Cache can use only the default memory while in persist we can use different memory management technique.
Persist:-
Caching :-
We will learn more about the memory management in our different post.
Further Reading :-https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
Code Download :- https://github.com/sangam92/Spark_tutorials
What can be done in such scenario ?
Spark can save the partial result and reuse it a number of times.This can reduce the extra process of creating the RDD again and again.
By default , Apache Spark do not persist the data and we need to instruct the spark that we need persistence.In fact Persisting is one of the optimization technique .
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x).
How to Persist the RDD ?
from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)
rdd_create = sc.textFile('test.txt')
rdd_create.persist
rdd_count = rdd_create.first()
rdd_count2= rdd_create.count()
print('The first line is' ,rdd_count)
print('The count of lines is ',rdd_count2)
Difference between Persist and cache:-
Cache can use only the default memory while in persist we can use different memory management technique.
Persist:-
Caching :-
We will learn more about the memory management in our different post.
Further Reading :-https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
Code Download :- https://github.com/sangam92/Spark_tutorials
No comments:
Post a Comment