Saturday, May 30, 2020

Hadoop - Cluster Set Up and RAID

The Hadoop cluster mainly resides in steel frame that resembles the modern day cup boards.The only difference between these rack and cupboard is that the rack does not have metal enclosure on all the sides like your traditional metal cupboards. Indeed, there would just like open cupboards where you would have holes on these pillar like structures in the racks for you to be able to bolt on the data nodes or basically the data node over here are just a chassis only machines.

A rack would be consisting a form like anywhere between 25 to 50 nodes and it’s not these racks just accommodate the data node. These racks can also accommodate the name node, the secondary name node or your job tracker.



It is just that is an open frame like structure where these chassis can be bolted upon and all of these machines within a rack will be able to communicate with the help of a switch at the speed with usually be around like 10 gigabytes per second and multiple racks would be able to communicate with the help of a multi-layer switch or an uplink switch which also does the functionality of that of a router.Usually the data transfer speed between the machine in the same rack is much higher than the data transfer speed between the machines across different racks.

Most of the data nodes would be having 2 hex core processor or 2 OctaCore processor meaning 8 Cores. There will be 2 CPU’s, each of them 8 cores and processing speed anywhere between 2.4 to 3.5 gigahertz CPU. the highest speed and the amount of RAM required typically varies according to the organizational needs.Usually your name node and Job tracker would be having higher RAM than that of your data nodes so the data node over here typically would be having anywhere between 50 to 500 GB of high speed RAM . And the frequency speed is somewhere around greater than 2000 megahertz and for the storage, the general formula like how much of hard drive storage you need for each data nodes is done by the rule of thumb that. If for every single core of CPU, you would be requiring at least 2 Terabyte of hard drive.


The question is why don’t we use RAID redundant array in expensive disc which has been since more than a decade? Why can't be used that in place of a distributed storage system as in the case of HDFS in Hadoop.

The reason being HDFS cluster do not benefit from using RAID for data nodes especially because your data nodes handle replication across several nodes which is a built in functionality of your HDFS and although RAID can actually be used for name node disc and secondary name node disc to just act as a backup against the fail over or the corruption of the meta data. Even if a disc fails in case of your Hadoop cluster your HDFS can continue to operate without the fail disc but that's not the case with the RAID, so due to these reasonsRAID is not the preferred choice out here in HDFS. p { margin-bottom: 0.25cm; line-height: 115% }

Saturday, May 23, 2020

Hadoop - Master Slave Architecture


Suppose we have a hadoop cluster of 1000 nodes and let us say out of which 3 nodes would predominantly be operating in a master mode that is the name node, secondary name node, and the Job tracker .The remaining 1000 minus 3, that is your 997 machines are going to be working in the slave mode and they are going to be the data nodes.We should note that the machines which is working in a slave mode is your data nodes and the machines which is working in the master mode which will be your Name node, Secondary Name Node and Job Tracker.


However, They actually vary slightly in terms of their hardware configurations. For example, the name node, the secondary name node and Job tracker need not have a very high hard drive storage space whereas your data nodes will be very high in terms of your Hard Drive storage space because there the once which actually is going to bare all the data loads in terms of all your big data storage is going to happen on these data nodes which is going to bare the bulk of the data operations.

It is also important to understand that a host operating system or a native operating system would be install on all of these machines, both which is working as data nodes in slave mode as well as the machines in master mode. The native operating system 90% of the time it‘s going to be a Linux based operating system and in some cases we do have Windows based operating system which is installed on these machines.


It is on this native operating system that Hadoop as a piece of software framework is going to be installed. We should note that , Hadoop would be installed on all of these machines on name node, secondary name node, job tracker as well as in each one of the data nodes. And the differentiating factor between the master mode and the machines in the slave mode is that it is the software configuration after installing Hadoop which would enable each of these machines to actually perform responsibilities associated with that of a name node, and that of a secondary name node and that of a job tracker.

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...