What Hadoop is not ?

 

Hadoop and MapReduce

Hadoop is a buzzword nowadays and people think it like a magic. But Hadoop is not a magic and not a general purpose application either. Hadoop is a program that is suitable for specific problems where

  1. Data volume is very large.
  2. Data velocity is comparatively higher.
  3. And organization needs highly distributed framework for their business needs.

So, if you are thinking Hadoop as a replacement of your regular database or regular filesystem then you would be highly disappointed. Here we are focusing on what Hadoop cannot instead of what Hadoop actually can.

Apache Hadoop is not a replacement of regular Database: Databases are great and they use SELECT command against indexes of stored data. Databases are organized in a way to fulfill the end user’s commands at best possible way and with time efficiency. Now, if you replace your database with HDFS then Hadoop would store data in form of files and you can’t directly access your data with regular SQL-like commands. You need to write MapReduce jobs for accessing your data and that would not be an easy task. It would take efforts and would also take time in execution.

Hadoop is a solution where data is very large, size threshold is not the point where you need the license to access data, large means where regular databases are unable to perform efficient queries. Hadoop system is a solution where data format is not regularized and can’t be feed to the regular database directly.HBase is pretty useful if you actually want to use Hadoop and also want to fetch data via SQL-Like commands.

Hadoop and MapReduce are not a place to learn Java Core: If you just introducing yourself to programming and especially Java Programming then Hadoop is not the right place to start with. Hadoop documentation says that you just need basics of Core Java to efficiently use Hadoop and MapReduce API’s. But you need to have knowledge of Java Errors, File paths, and Java Debugging prior to starting with Hadoop.

Hadoop is not the ideal place to learn Networking error messages and Linux System programming: Hadoop path is a lot easy if you already are familiar with “Connection Refused” and “No route to Host” error messages. Hadoop has nothing to do with such kind of networking error messages. So, ideally, you should know TCP-IP errors, LAN handling, and other common network protocols. Hadoop expects that clusters are well connected and network knowledge is needed to ensure that. On the same ground, you also should know your way around Linux/Unix systems. You should have prior basic knowledge of how to install Unix/Linux. Hadoop framework expects users to have knowledge of how to handle DNS errors, how to keep logs on separate disks other than root disks and should also know what files are there in etc/ directory.

Apart from this you should also brush your skills over

  • SSH, what it is, how to set up authorized_keys, how to use ssh and scp
  • ifconfic, nslookup and other network config/diagnostics tools
  • How your platform keeps itself up to date
  • What the various log files your machine generates, and what they mean
  • How to set up native filesystmes and mount them

 

 

Distributed file systems: Introduction to HDFS.

HDFS

The HDFS, Hadoop Distributed File System, is a distributed file designed to hold and manage very large files (some terabytes or petabytes). Files are stored in a redundant manner across the various machine, spread over the network, to ensure the high availability and to ensure durability to failure.

What is Distributed File System?
A distributed file system is designed to hold the large amount of data across the network and destined to provide access to various clients distributed across the network.
Network File System (NFS) is oldest among DFS’s and still in heavy use. NFS is the most straightforward system but has the limitation as well. NFS is designed in a way to provide remote access to a client to a single logical unit on a single remote machine. The client can see this unit and even can also mount that unit on her own machine as well.
Once mounted on a local machine, the client can use that unit as a part of her own machine and can use Linux and other file related commands directly on that unit.
But when we talk about the storage space we are again limited to the single machine and we can be choked after a top computational power. The single machine has its own limitations in terms of power and storage, hence NFS might not be useful in terms of Big Data implementations.
On the other side, HDFS is designed to perform the job specifically for Big Data environment. It has the clear edge on NFS and other DFS systems. While using HDFS, a client never needs to have a local copy of data before processing that data. If we move specifically then:

1. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
2. HDFS files are available on various machines across the network so we can use normal commodity hardware in place of high-end machines. This arrangement uses to cut cost up to great extent.
3. HDFS stores data reliably on various machines and in a case of even complete hardware failure data is available with at least two more locations.
4. HDFS provides fast access to information and if a large number of clients want to access machine we can simply increase the power of machine by adding more clusters and by adding more commodity hardware to our systems.
5. HDFS has dedicated tool for distributing data among various machines and for getting it back after processing. MapReduce performs this job and it has proper sync with Hadoop.

Even if Hadoop is very good for large scaled environments, it is not good for general purpose systems like NFS is. Hadoop is optimized to perform highly scaled up jobs where thousands of cores needed to work, it is best optimized for thousand or even more clusters. Hence if we are using Hadoop on the single machine or on a small number of clusters then Hadoop can’t provide competitive results.
Hadoop is designed and implemented on Google File System (GFS). It was properly described by a white paper published by Google to explain it internal working for file storage.
We would discuss the block based file structure of HDFS in next article. You can get a better grasp by accessing the Google’s white paper (given in the link above) over GFS.

Problem Statement: Large scale distributed environment and Hadoop.

Hadoop ChallengesHadoop is a buzzword since some past years and nowadays people actually are working with Hadoop ecosystem at an enterprise level.

Even after the use of Hadoop at an enterprise level, newcomer and especially students use to face a lot of difficulty in getting an understanding of Hadoop.

“Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.” ** Yahoo Developer Network

Large scale files never meant for some gigabytes, it actually means some terabytes or even petabytes of files. Big Data actually consists of three V’s. Volume (of Data, some petabytes), Velocity (of Data), Veracity¬†(of data, an uncertainty of data). Some people also include 4th V for Variety (of data).

Now let’s point out the main challenges of large-scale distributed systems. We are just pointing out those challenges and possibly we would share them in more details in other articles.

1. The network can expect partial or total failure anytime. Router or switch might face break down.

2. Data, when needed, may not arrive at some specific point due to unexpected network congestion.

3. Individual computing nodes may get failed due to overheating or some other issue. Or even can face issue due to run of out of memory space.

4. The client might have multiple software versions at different nodes or even client might be using different data processing software. This might be a big issue in large-scale distributed systems.

5. Security concerns among various nodes as Hadoop don’t specify any specific security protocol for various nodes. So, it is next to impossible to detect and rectify a man in middle attack.

6. The clock speed of various processors is again a big concern as it is essential to make them properly synchronized.

So, while setting up a large scale big data environment, one must take care of above critical challenges. Hadoop ecosystem addresses these challenges and personal preferences also work sometimes to address other challenges.

Please share your doubts and suggestions, would share more of my learnings and experience with you soon.