Data is Future: August 2018

Friday, August 17, 2018

HDFS vs NFS

We always wonder why we need the HDFS when we have the native filesystem concepts. What makes us create a new file system? Do we have any drawbacks for NFS?

We will try to see this all things one by one.Let us have a small glance on the NFS.
Filesystem acts as a "guardian" for your whole data and meta data .It has all the necessary information about the file and folders that are kept within them.

1.) Filesystem plays a major role when we try to read or write files in our hard disk, all such kind of request is controlled by the filesystem.
2.) Filesystem controls the permission and security.
3.) Filesystem contains the metadata about the files, folders like size, owner, time and file type.

Examples :- FAT32,NTFS,EXA3,EXA4,XFS,HFS,HFS+

Since I have a windows machine.I can check the filesystem.

I have an NTFS (New Technology File System) file system that was developed by Microsoft. It came after FAT32 which has a very small size around 4 GB.

When we came across the filesystem like the ext4 ,what we found that they do not have distributed nature of the filesystem. Precisely speaking the node 1 does not know what is available in node 2 and vice versa. They are more of a local file system.

The second thing that should be remembered that since the node 1 does not have the idea what is available in node 2. So, the replication of data is a cumbersome job here. We are practically, exposed to a data loss in such a scenario.

When we upload a file to HDFS. it will be automatically split into 128 MB fixed size blocks in the recent versions of Hadoop. HDFS takes care of placing the blocks in different nodes and also take care of replicating each block in more than one node. By default, HDFS replicates a block to 3 nodes.
HDFS by no means a replacement for your local filesystem. The operating system still relies on the local filesystem. In fact, the operating system does not care about the presence of HDFS. One more interesting thing. HDFS should still go through ext4 to save the blocks in the storage.

The true power of HDFS is that it is spread across all the nodes in your cluster and it has a distributed view of the cluster and hence it knows how to construct the bigger data set from blocks. Whereas the ext4 does not have a distributed view and only has a local view will only have the idea about the blocks in storage it is managing.

Wednesday, August 15, 2018

Map Reduce -Speculative Execution

In Hadoop Map Reduce, the jobs are broken down into the smaller tasks and these tasks are set to run in parallel.This kind of execution plan will increase the efficiency of the job to a large extent compared to sequential job execution model.

The problem occurs when we encounter the slow tasks as they can impact the overall execution plan .This scenario is very common in a full-fledged production environment where we can have thousands of job running parallely.

In such kind of scenarios, the speculative execution works as a boon for the complete job. Hadoop tries to detect the slow running tasks and run the duplicate tasks in place of a slower task.This process is called Speculative Execution.

Speculative execution in Hadoop does not imply that launching duplicate tasks at the same time so they can race. As this will result in wastage of resources in the cluster. Rather, a speculative task is launched only after a task runs for the significant amount of time and framework detects it running slow as compared to other tasks, running for the same job.

Once the task gets completed successfully, the Hadoop framework kills the task that is still running. It means that the either of the two tasks that will finish early, the slower one is terminated by the Hadoop Framework.

Speculative execution is a MapReduce job optimization technique in Hadoop that is enabled by default. We can disable speculative execution for mappers and reducers in mapred-site.xml as shown below:
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>

We should note that the speculative execution can also leads to the cluster inefficiency and impact the overall throughput.There is a good case for turning off speculative execution for reduce tasks, since any duplicate reduce tasks have to fetch the same map outputs as the original task, and this can significantly increase network traffic on the cluster.

Tuesday, August 14, 2018

HDFS - File Permission

The file permission system in HDFS follows almost the POSIX system. POSIX system supports three types of permission .

READ (r)
WRITE (w)
EXECUTE (x)

The read permission allows us to read files and list the contents of a directory.The write permission is required to write a file or, for a directory, to create or delete files or directories in it.The executeperission is not allowed as you cannot execute a file in HDFS. However ,you can have an execute permission on directory to access it's children.