Wednesday, February 28, 2018

RDD In Apache Spark

RDD is considered as the core and heart of the Spark.For any program to run in the Spark, the RDD is there inside it .It can be created , transformed or called in any operation .Spark automatically distributes the data set within RDD across the cluster.

RDD is an immutable distributed collection of object  and is fault tolerant , can be operated in parallel.
RDD can be created in two different ways :-

1.) Parallelizing existing object like list or set in the driver program.
2.) Referencing an external dataset like HDFS, textfile,csv fille etc.

Once the RDD is created, two important processes can take place :-

1.) Transformation :- Generating one RDD from another RDD , can be commonly used in the    filtering process.
2.) Action :- Compute a result on the basis of a RDD like count, first.

Lazy Evaluation in Spark :-

Spark works on the concept of Lazy evaluation .It means that the RDD cannot be created until spark finds an action.Initially it seems weird but in handling Big data ,this concept is quite useful. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...