What's Hadoop ?
Apache Hadoop is a open-source software library which allows distributed processing of large data sets across clusters of computers using simple programming models. It's scaleable from single machine to thousands of cluster computers each capable of computing and storing data. Therefore Hadoop can be use to develop high-available services on top of cluster computers.
Hadoop framework is set of open source projects which provides common set of services. Key Attribute of Hadoop is that it's Redundant and reliable there's zero data loss even if several machine in a cluster fails.
Following are main modules available in latest hadoop release.
- Hadoop Common: Utility framework which support other Hadoop modules.
- Hadoop Distributed File System (HDFS™): Distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop framework mainly targets on addressing following problems which are in traditional single disk based file systems.
- Slow read/write speeds in disks -
There are performance limits in most of storage devices as most of them are still based on moving mechanical components. Still it's not practical to use SSD's for mass storage due to high costs. Therefore solution needed as a workaround to increase read / write speeds as a workaround.
Hadoop presented a simple solution for this issue. By replicating same data in multiple disks it can achieve high read speeds by parallel reading from multiple disks. - Hardware failure -
Since hard drives are based on mechanical equipments. it tends to fail more compared to other components in a computer.
This issue also handled in Hadoop by replicating data in several hard drives and storages. Incase of one hard drive fails another hard drive can act as back up and restore data. - How to merge data from different sources and communicate effectively -
Since requested data can be in several hard drives how to combine them together and transmit to requested parties with most efficient way was another main issue to be addressed.
HDFS - Map reduced method comes into picture to address this issue. Only complete final result will be transmitted through the network. It will transmitted using compressed format (Map reduced) in order to save data transmission time and network bandwidth.
Technical Overview
Basically Hadoop Machine consist of two parts.
- Task Tracker : This is the processing part in Hadoop machine (Map Reduce Server)
- Data Node : Data part in Hadoop machine (HDFS Server)
Hadoop Cluster consists of 3 main parts.
- Name Node : Keeps directory tree of all files in the file system. (index of files in the HDFS)
- Job Tracker : The service which allocate map reduce tasks to each task tracker in the pool.
- Pool of Hadoop Machines