Distributed file systems: Introduction to HDFS.

HDFS

The HDFS, Hadoop Distributed File System, is a distributed file designed to hold and manage very large files (some terabytes or petabytes). Files are stored in a redundant manner across the various machine, spread over the network, to ensure the high availability and to ensure durability to failure.

What is Distributed File System?
A distributed file system is designed to hold the large amount of data across the network and destined to provide access to various clients distributed across the network.
Network File System (NFS) is oldest among DFS’s and still in heavy use. NFS is the most straightforward system but has the limitation as well. NFS is designed in a way to provide remote access to a client to a single logical unit on a single remote machine. The client can see this unit and even can also mount that unit on her own machine as well.
Once mounted on a local machine, the client can use that unit as a part of her own machine and can use Linux and other file related commands directly on that unit.
But when we talk about the storage space we are again limited to the single machine and we can be choked after a top computational power. The single machine has its own limitations in terms of power and storage, hence NFS might not be useful in terms of Big Data implementations.
On the other side, HDFS is designed to perform the job specifically for Big Data environment. It has the clear edge on NFS and other DFS systems. While using HDFS, a client never needs to have a local copy of data before processing that data. If we move specifically then:

1. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
2. HDFS files are available on various machines across the network so we can use normal commodity hardware in place of high-end machines. This arrangement uses to cut cost up to great extent.
3. HDFS stores data reliably on various machines and in a case of even complete hardware failure data is available with at least two more locations.
4. HDFS provides fast access to information and if a large number of clients want to access machine we can simply increase the power of machine by adding more clusters and by adding more commodity hardware to our systems.
5. HDFS has dedicated tool for distributing data among various machines and for getting it back after processing. MapReduce performs this job and it has proper sync with Hadoop.

Even if Hadoop is very good for large scaled environments, it is not good for general purpose systems like NFS is. Hadoop is optimized to perform highly scaled up jobs where thousands of cores needed to work, it is best optimized for thousand or even more clusters. Hence if we are using Hadoop on the single machine or on a small number of clusters then Hadoop can’t provide competitive results.
Hadoop is designed and implemented on Google File System (GFS). It was properly described by a white paper published by Google to explain it internal working for file storage.
We would discuss the block based file structure of HDFS in next article. You can get a better grasp by accessing the Google’s white paper (given in the link above) over GFS.

Problem Statement: Large scale distributed environment and Hadoop.

Hadoop ChallengesHadoop is a buzzword since some past years and nowadays people actually are working with Hadoop ecosystem at an enterprise level.

Even after the use of Hadoop at an enterprise level, newcomer and especially students use to face a lot of difficulty in getting an understanding of Hadoop.

“Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.” ** Yahoo Developer Network

Large scale files never meant for some gigabytes, it actually means some terabytes or even petabytes of files. Big Data actually consists of three V’s. Volume (of Data, some petabytes), Velocity (of Data), Veracity¬†(of data, an uncertainty of data). Some people also include 4th V for Variety (of data).

Now let’s point out the main challenges of large-scale distributed systems. We are just pointing out those challenges and possibly we would share them in more details in other articles.

1. The network can expect partial or total failure anytime. Router or switch might face break down.

2. Data, when needed, may not arrive at some specific point due to unexpected network congestion.

3. Individual computing nodes may get failed due to overheating or some other issue. Or even can face issue due to run of out of memory space.

4. The client might have multiple software versions at different nodes or even client might be using different data processing software. This might be a big issue in large-scale distributed systems.

5. Security concerns among various nodes as Hadoop don’t specify any specific security protocol for various nodes. So, it is next to impossible to detect and rectify a man in middle attack.

6. The clock speed of various processors is again a big concern as it is essential to make them properly synchronized.

So, while setting up a large scale big data environment, one must take care of above critical challenges. Hadoop ecosystem addresses these challenges and personal preferences also work sometimes to address other challenges.

Please share your doubts and suggestions, would share more of my learnings and experience with you soon.