Saturday, June 23, 2018

HDFS-Write Architecture

Let us understand the HDFS -Write architecture with an example Suppose we have a file called file.txt of size 150 MB to be stored into the HDFS.We have 128 MB as the default block size.So, we can have two blocks , Block A with size 128 MB and Block B with size 22 MB.


Below are the steps that will be followed while writing the data into the HDFS:-

⦁    The Client machine will communicate with the NameNode for the write request in the Block  A     and Block B.
⦁    On the basis of availability , replication factor and Rack awareness , NameNode will provide the IP addresses of the datanodes were the files need to be copied .
⦁    Suppose that the replication factor is set to 3 , Then the client will receive 6 ip addresses .3 for Block A and 3 for Block B.
⦁    We should remember that since the replication factor is set to 3 , then the Block A and Block B , each must be copied to the three datanodes respectively.

⦁    The complete data copying processes has been divided into three parts:-

  •    Setting up the Pipeline.
  •    Streaming the data
  •    Shutting down the Pipeline

Setting up the Pipe Line :-

Let us consider that we have the datanodes  1,3 and 4 for the block A and 2,5, 7 for the Block B.Before streaming the data, client  will try to confirm whether the datanodes are ready to receive the data or not.To acheive this , client will perform the following steps :-
⦁    So for the Block A , client will try to form a connection with the dataNode 1 which is a TCP/IP connection.
⦁    The client will inform the datanode 1 that it is going to receive the data and provide the ip addresses of the next two datanodes which is 3 and 4 in this case.
⦁    The datanode 1 will connect to DataNode 4 and inform that it will receive the data and provide the ip address of the next datanode which is 6 in this case.
⦁    The datanode 4 will connect to datanode 6 and inform that the datanode 6 is going to receive the data.
⦁    The acknowledgement will work in the reverse order i.e from the Datanode 6 to 4 then to 1.
⦁    After this datanode 1  will inform the client that it is ready to receive the data .The complete pipeline set up will get finished.

Streaming the data:-

Once the Pipeline will get set , the client will start the copying of the data into the datanode . It will first copy the data into the datanode 1 .After this the data replication is dome sequentially.datanode 1 will copy the data into the datanode 4. Later , the DataNode 4 will replicate the data into datanode 6.

Shutting down the Pipeline:- 

 Once the block has been copied into all the 3 datanode , a series of acknowledgements will take place to ensure the client and NameNode that the data has been written successfully. Then, the client will finally close the pipeline to end the TCP session.
⦁     The acknowledgement happens in the reverse sequence i.e. from datanode 6 to 4 and then to 1.
⦁    Finally, the datanode 1 will push three acknowledgements (including its own) into the pipeline and send it to the client.
⦁    The client will inform NameNode that data has been written successfully.
⦁    The NameNode will update its metadata and the client will shut down the pipeline.
We should note that  the Writing on Block B will also happen simultaneously.It will follow the same steps that was followed by the Block A.





Further Reading :- http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
  




No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...