Hadoop is a buzzword since some past years and nowadays people actually are working with Hadoop ecosystem at an enterprise level.
Even after the use of Hadoop at an enterprise level, newcomer and especially students use to face a lot of difficulty in getting an understanding of Hadoop.
“Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.” ** Yahoo Developer Network
Large scale files never meant for some gigabytes, it actually means some terabytes or even petabytes of files. Big Data actually consists of three V’s. Volume (of Data, some petabytes), Velocity (of Data), Veracity (of data, an uncertainty of data). Some people also include 4th V for Variety (of data).
Now let’s point out the main challenges of large-scale distributed systems. We are just pointing out those challenges and possibly we would share them in more details in other articles.
1. The network can expect partial or total failure anytime. Router or switch might face break down.
2. Data, when needed, may not arrive at some specific point due to unexpected network congestion.
3. Individual computing nodes may get failed due to overheating or some other issue. Or even can face issue due to run of out of memory space.
4. The client might have multiple software versions at different nodes or even client might be using different data processing software. This might be a big issue in large-scale distributed systems.
5. Security concerns among various nodes as Hadoop don’t specify any specific security protocol for various nodes. So, it is next to impossible to detect and rectify a man in middle attack.
6. The clock speed of various processors is again a big concern as it is essential to make them properly synchronized.
So, while setting up a large scale big data environment, one must take care of above critical challenges. Hadoop ecosystem addresses these challenges and personal preferences also work sometimes to address other challenges.
Please share your doubts and suggestions, would share more of my learnings and experience with you soon.