Data is Future: June 2018

Tuesday, June 26, 2018

HDFS - Replication Factor

Replication Factor is a property in Hadoop ecosystem that signifies the number of replicas of a block that can be kept in the cluster. We already know that the data in the Hadoop distributed system is stored in the commodity hardware. To maintain a high data availability, we need to have the replicas of the data. The default replication factor in HDFS is 3.

When the number of replicas of the data is below the replication factor, then it is called under-replicated blocks.
Replication < Replication Factor

Example :- Let us assume that we have 10 TB storage and the data size is 4 TB. The default replication factor is 3. In such a case , all the blocks does not have a replica. Such scenarios are where we ave under replicated data.

When the number of replicas are more than the replication factor , then it is called over replicated blocks.
Replication > Replication Factor

Over Replicated blocks normally occurs when crashed data nodes become alive.
We can check the file details like replication factor, corrupt files , Under replicated blocks and many more details with the help of the hdfs fsck / command. The / here denotes the root directory.
The replication factor details are available in the file hdfs-site.xml which is available in the Hadoop Installation Directory.

<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>

We can set the replication factor on the per file basis via below commands:-
hadoop fs -setrep -w 2 /dir/

HDFS -Read Architecture

In the last blog post , we have gone through the HDFS Write architecture.The Read Architecture is quite simple and easy to understand.
Suppose that we have a file "file.txt" that need to be read from the HDFS. Below steps will take place while reading the data from the HDFS.

The Client will create a read request to the Name Node which is having the metadata for the file "file.txt".
Name Node will reply back to the client providing the ip's of the data nodes which are having the "file.txt".
The Client will connect to any of the data nodes and start retrieving the data.
Once the client will get the required file, the connection will get closed.
In case, the data is coming from the multiple blocks, it will combine these blocks to form a file.

While serving read request of the client, HDFS selects the replica which is closest to the client. This reduces the read latency and the bandwidth consumption.

Saturday, June 23, 2018

HDFS-Write Architecture

Let us understand the HDFS -Write architecture with an example Suppose we have a file called file.txt of size 150 MB to be stored into the HDFS.We have 128 MB as the default block size.So, we can have two blocks , Block A with size 128 MB and Block B with size 22 MB.

Below are the steps that will be followed while writing the data into the HDFS:-

⦁    The Client machine will communicate with the NameNode for the write request in the Block A     and Block B.
⦁    On the basis of availability , replication factor and Rack awareness , NameNode will provide the IP addresses of the datanodes were the files need to be copied .
⦁    Suppose that the replication factor is set to 3 , Then the client will receive 6 ip addresses .3 for Block A and 3 for Block B.
⦁    We should remember that since the replication factor is set to 3 , then the Block A and Block B , each must be copied to the three datanodes respectively.

⦁    The complete data copying processes has been divided into three parts:-

Setting up the Pipeline.
Streaming the data
Shutting down the Pipeline

Setting up the Pipe Line :-

Let us consider that we have the datanodes 1,3 and 4 for the block A and 2,5, 7 for the Block B.Before streaming the data, client will try to confirm whether the datanodes are ready to receive the data or not.To acheive this , client will perform the following steps :-
⦁    So for the Block A , client will try to form a connection with the dataNode 1 which is a TCP/IP connection.
⦁    The client will inform the datanode 1 that it is going to receive the data and provide the ip addresses of the next two datanodes which is 3 and 4 in this case.
⦁    The datanode 1 will connect to DataNode 4 and inform that it will receive the data and provide the ip address of the next datanode which is 6 in this case.
⦁    The datanode 4 will connect to datanode 6 and inform that the datanode 6 is going to receive the data.
⦁    The acknowledgement will work in the reverse order i.e from the Datanode 6 to 4 then to 1.
⦁    After this datanode 1 will inform the client that it is ready to receive the data .The complete pipeline set up will get finished.

Streaming the data:-

Once the Pipeline will get set , the client will start the copying of the data into the datanode . It will first copy the data into the datanode 1 .After this the data replication is dome sequentially.datanode 1 will copy the data into the datanode 4. Later , the DataNode 4 will replicate the data into datanode 6.

Shutting down the Pipeline:-

Once the block has been copied into all the 3 datanode , a series of acknowledgements will take place to ensure the client and NameNode that the data has been written successfully. Then, the client will finally close the pipeline to end the TCP session.
⦁    The acknowledgement happens in the reverse sequence i.e. from datanode 6 to 4 and then to 1.
⦁    Finally, the datanode 1 will push three acknowledgements (including its own) into the pipeline and send it to the client.
⦁    The client will inform NameNode that data has been written successfully.
⦁    The NameNode will update its metadata and the client will shut down the pipeline.
We should note that the Writing on Block B will also happen simultaneously.It will follow the same steps that was followed by the Block A.

Further Reading :- http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Sunday, June 17, 2018

Spark - DAG

What is DAG ?
DAG is the acronym for Directed Acyclic Graph.As per Wikipedia , it is a finite directed graph with no cycles.It consists of many vertices and edges and with each edge directed from one vertex.A directed acyclic graph has a topological ordering. This means that the nodes are ordered so that the starting node has a lower value than the ending node. A DAG has a unique topological ordering if it has a directed path containing all the nodes; in this case the ordering is the same as the order in which the nodes appear in the path.In computer science, DAGs are also called wait-for-graphs. When a DAG is used to detect a deadlock, it illustrates that a resources has to wait for another process to continue.

DAG in Apache Spark :- The map reduce concept involved around the two predefined stages of DAG ; Map and Reduce.To overcome this limitation ,spark introduces the concept of DAG with any number of stages.This attribute of the Spark make it faster than the conventional Map reduce job.
Once the job is submitted to DAG scheduler , the DAG scheduler divides the job into the stages . A stage is comprised of tasks based on partition of input data.
The stages are later passed into the Task Scheduler which launches the job via cluster manager.
There are following steps of the process defining how spark creates a DAG: once the user submits an apache spark application to spark.Then driver module takes the application from spark side.The driver performs several tasks on the application. That helps to identify whether transformations and actions are present in the application.All the operations are arranged further in a logical flow of operations, that arrangement is DAG.Then DAG graph converted into the physical execution plan which contains stages.
As we discussed earlier driver identifies transformations. It also sets stage boundaries according to the nature of transformation. There are two types of transformation process applied on RDD:

1. Narrow transformations 2. Wide transformations. Let’s discuss each in brief :
Narrow Transformations – Transformation process like map() and filter() comes under narrow transformation. In this process, it does not require to shuffle the data across partitions.
Wide Transformations – Transformation process like ReduceByKey comes under wide transformation. In this process, it is required shuffling the data across partitions.
As wide Transformation requires data shuffling that shows it results in stage boundaries.
After all, DAG scheduler makes a physical execution plan, which contains tasks. Later on, those tasks are joint to make bundles to send them over the cluster.
We should also ote that in case of any issue the DAG contains the complete execution plan.We can recover the data loss by identifying the RDD where the data loss occurs.

Saturday, June 9, 2018

Spark -Pair RDD

Spark RDD has a special operation called paired rdd . We can also say like that a RDD with Key, value pair is called Paired RDD.The paired RDD contains two data items that are associated with each other. It is somewhat similar to Dictionary in Python , Maps in Scala.

Creating Pair RDD :-

We can easily convert a RDD into Pair RDD using the map function.The result set contain the value with key,value pair.
We have created a sample data set and place it in the HDFS home directory :-

Then we need to invoke the spark shell using pyspark command if we are using python as the programming language.

Python code :-

Output :- The out put some what looks like this.

Once the paired RDD has been created we can apply a lot of transformation on the pair RDD like reduceByKey(),groupByKey().