Sunday, June 17, 2018

Spark - DAG

What is DAG ?
DAG is the acronym for Directed Acyclic Graph.As per Wikipedia , it is a finite directed graph with no cycles.It consists of many vertices and edges and with each edge directed from one vertex.A directed acyclic graph has a topological ordering. This means that the nodes are ordered so that the starting node has a lower value than the ending node. A DAG has a unique topological ordering if it has a directed path containing all the nodes; in this case the ordering is the same as the order in which the nodes appear in the path.In computer science, DAGs are also called wait-for-graphs. When a DAG is used to detect a deadlock, it illustrates that a resources has to wait for another process to continue.



DAG in Apache Spark :-
The map reduce concept involved around the two predefined stages of DAG ; Map and Reduce.To overcome this limitation ,spark introduces the concept of DAG with any number of stages.This attribute of the Spark make it faster than the conventional Map reduce job.
Once the job is submitted to DAG scheduler , the DAG scheduler divides the job into the stages . A stage is comprised of tasks based on partition of input data.
The stages are later passed into the Task Scheduler which launches the job via cluster manager.
There are following steps of the process defining how spark creates a DAG: once the user submits an apache spark application to spark.Then driver module takes the application from spark side.The driver performs several tasks on the application. That helps to identify whether transformations and actions are present in the application.All the operations are arranged further in a logical flow of operations, that arrangement is DAG.Then DAG graph converted into the physical execution plan which contains stages.
                                                                As we discussed earlier driver identifies transformations. It also sets stage boundaries according to the nature of transformation. There are two types of transformation process applied on RDD:

1. Narrow transformations 2. Wide transformations. Let’s discuss each in brief :
Narrow Transformations – Transformation process like map() and filter() comes under narrow transformation. In this process, it does not require to shuffle the data across partitions.
Wide Transformations – Transformation process like ReduceByKey comes under wide transformation. In this process, it is required shuffling the data across partitions.
As wide Transformation requires data shuffling that shows it results in stage boundaries.
After all, DAG scheduler makes a physical execution plan, which contains tasks. Later on, those tasks are joint to make bundles to send them over the cluster.
We should also ote that in case of any issue the DAG contains the complete execution plan.We can recover the data loss by identifying the RDD where the data loss occurs.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...