Distributed file systems: Introduction to HDFS.


The HDFS, Hadoop Distributed File System, is a distributed file designed to hold and manage very large files (some terabytes or petabytes). Files are stored in a redundant manner across the various machine, spread over the network, to ensure the high availability and to ensure durability to failure.

What is Distributed File System?
A distributed file system is designed to hold the large amount of data across the network and destined to provide access to various clients distributed across the network.
Network File System (NFS) is oldest among DFS’s and still in heavy use. NFS is the most straightforward system but has the limitation as well. NFS is designed in a way to provide remote access to a client to a single logical unit on a single remote machine. The client can see this unit and even can also mount that unit on her own machine as well.
Once mounted on a local machine, the client can use that unit as a part of her own machine and can use Linux and other file related commands directly on that unit.
But when we talk about the storage space we are again limited to the single machine and we can be choked after a top computational power. The single machine has its own limitations in terms of power and storage, hence NFS might not be useful in terms of Big Data implementations.
On the other side, HDFS is designed to perform the job specifically for Big Data environment. It has the clear edge on NFS and other DFS systems. While using HDFS, a client never needs to have a local copy of data before processing that data. If we move specifically then:

1. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
2. HDFS files are available on various machines across the network so we can use normal commodity hardware in place of high-end machines. This arrangement uses to cut cost up to great extent.
3. HDFS stores data reliably on various machines and in a case of even complete hardware failure data is available with at least two more locations.
4. HDFS provides fast access to information and if a large number of clients want to access machine we can simply increase the power of machine by adding more clusters and by adding more commodity hardware to our systems.
5. HDFS has dedicated tool for distributing data among various machines and for getting it back after processing. MapReduce performs this job and it has proper sync with Hadoop.

Even if Hadoop is very good for large scaled environments, it is not good for general purpose systems like NFS is. Hadoop is optimized to perform highly scaled up jobs where thousands of cores needed to work, it is best optimized for thousand or even more clusters. Hence if we are using Hadoop on the single machine or on a small number of clusters then Hadoop can’t provide competitive results.
Hadoop is designed and implemented on Google File System (GFS). It was properly described by a white paper published by Google to explain it internal working for file storage.
We would discuss the block based file structure of HDFS in next article. You can get a better grasp by accessing the Google’s white paper (given in the link above) over GFS.