What Hadoop is not ?


Hadoop and MapReduce

Hadoop is a buzzword nowadays and people think it like a magic. But Hadoop is not a magic and not a general purpose application either. Hadoop is a program that is suitable for specific problems where

  1. Data volume is very large.
  2. Data velocity is comparatively higher.
  3. And organization needs highly distributed framework for their business needs.

So, if you are thinking Hadoop as a replacement of your regular database or regular filesystem then you would be highly disappointed. Here we are focusing on what Hadoop cannot instead of what Hadoop actually can.

Apache Hadoop is not a replacement of regular Database: Databases are great and they use SELECT command against indexes of stored data. Databases are organized in a way to fulfill the end user’s commands at best possible way and with time efficiency. Now, if you replace your database with HDFS then Hadoop would store data in form of files and you can’t directly access your data with regular SQL-like commands. You need to write MapReduce jobs for accessing your data and that would not be an easy task. It would take efforts and would also take time in execution.

Hadoop is a solution where data is very large, size threshold is not the point where you need the license to access data, large means where regular databases are unable to perform efficient queries. Hadoop system is a solution where data format is not regularized and can’t be feed to the regular database directly.HBase is pretty useful if you actually want to use Hadoop and also want to fetch data via SQL-Like commands.

Hadoop and MapReduce are not a place to learn Java Core: If you just introducing yourself to programming and especially Java Programming then Hadoop is not the right place to start with. Hadoop documentation says that you just need basics of Core Java to efficiently use Hadoop and MapReduce API’s. But you need to have knowledge of Java Errors, File paths, and Java Debugging prior to starting with Hadoop.

Hadoop is not the ideal place to learn Networking error messages and Linux System programming: Hadoop path is a lot easy if you already are familiar with “Connection Refused” and “No route to Host” error messages. Hadoop has nothing to do with such kind of networking error messages. So, ideally, you should know TCP-IP errors, LAN handling, and other common network protocols. Hadoop expects that clusters are well connected and network knowledge is needed to ensure that. On the same ground, you also should know your way around Linux/Unix systems. You should have prior basic knowledge of how to install Unix/Linux. Hadoop framework expects users to have knowledge of how to handle DNS errors, how to keep logs on separate disks other than root disks and should also know what files are there in etc/ directory.

Apart from this you should also brush your skills over

  • SSH, what it is, how to set up authorized_keys, how to use ssh and scp
  • ifconfic, nslookup and other network config/diagnostics tools
  • How your platform keeps itself up to date
  • What the various log files your machine generates, and what they mean
  • How to set up native filesystmes and mount them