Saturday, May 30, 2020

Hadoop - Cluster Set Up and RAID

The Hadoop cluster mainly resides in steel frame that resembles the modern day cup boards.The only difference between these rack and cupboard is that the rack does not have metal enclosure on all the sides like your traditional metal cupboards. Indeed, there would just like open cupboards where you would have holes on these pillar like structures in the racks for you to be able to bolt on the data nodes or basically the data node over here are just a chassis only machines.

A rack would be consisting a form like anywhere between 25 to 50 nodes and it’s not these racks just accommodate the data node. These racks can also accommodate the name node, the secondary name node or your job tracker.



It is just that is an open frame like structure where these chassis can be bolted upon and all of these machines within a rack will be able to communicate with the help of a switch at the speed with usually be around like 10 gigabytes per second and multiple racks would be able to communicate with the help of a multi-layer switch or an uplink switch which also does the functionality of that of a router.Usually the data transfer speed between the machine in the same rack is much higher than the data transfer speed between the machines across different racks.

Most of the data nodes would be having 2 hex core processor or 2 OctaCore processor meaning 8 Cores. There will be 2 CPU’s, each of them 8 cores and processing speed anywhere between 2.4 to 3.5 gigahertz CPU. the highest speed and the amount of RAM required typically varies according to the organizational needs.Usually your name node and Job tracker would be having higher RAM than that of your data nodes so the data node over here typically would be having anywhere between 50 to 500 GB of high speed RAM . And the frequency speed is somewhere around greater than 2000 megahertz and for the storage, the general formula like how much of hard drive storage you need for each data nodes is done by the rule of thumb that. If for every single core of CPU, you would be requiring at least 2 Terabyte of hard drive.


The question is why don’t we use RAID redundant array in expensive disc which has been since more than a decade? Why can't be used that in place of a distributed storage system as in the case of HDFS in Hadoop.

The reason being HDFS cluster do not benefit from using RAID for data nodes especially because your data nodes handle replication across several nodes which is a built in functionality of your HDFS and although RAID can actually be used for name node disc and secondary name node disc to just act as a backup against the fail over or the corruption of the meta data. Even if a disc fails in case of your Hadoop cluster your HDFS can continue to operate without the fail disc but that's not the case with the RAID, so due to these reasonsRAID is not the preferred choice out here in HDFS. p { margin-bottom: 0.25cm; line-height: 115% }

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...