Generally Map Reduce is considered as the most complicated part of Hadoop. That’s the reason it has been divided in following two parts :-
Map Reduce Framework
Map Reduce Programming Technique
Map Reduce Framework provides alot of java API’s classes and interfaces to create MR(map-reduce) Programs. The framework also takes care of many internal things as a reason it is also known as Map-Reduce Execution Engine. Map Reduce Programming Technique is the way of thinking and solving many Big-Data Problems. It is the unique approach popularised by Hadoop. Please keep in mind that it is not the universal technique to solve all Big-Data problems but it can address many of them.
Normally, more focus is shown over Map Reduce Programming Technique because Map Reduce Framework is always considered as tough tasks and that’s the reason the Apache Spark has introduced. Map Reduce Execution Engine has been left behind by Modern Execution Engine i.e. Apache Spark (I will Explain in future). So, the blog also put more focus on making you understand technique i.e. Map Reduce Programming Technique.
Map Reduce Technique
Lets understand with an example. Suppose you have 40TB (terabyte) of file and the job is to count the number of lines in that file. You can’t store 40TB file on single machine so it will be saved on Hadoop Cluster (i.e. in HDFS). You may say that its not a tough task to count the lines over 40TB file as you will writing some piece of code which will be reading the file line by line and on every read operation you will increment the counter by 1 and once you reach the EOF(End-of-File) you will display the total counter. Well, the above solution is considered good in Normal condition (File System) and worst in Hadoop Cluster.
The reason behind calling above solution as worst in Hadoop Cluster is : lets assume you have written some piece of java code on a single machine and it will be reading the 40TB file from cluster and on every successful reading of line it will increment the counter by 1 . Now, while doing all these operations you have not realise that you are moving 40TB of data towards your single machine. It will take hours to complete your job because your are moving huge amount of data over cluster.
Creators of Hadoop realise the above fact and solved it using a exceptionally good technique i.e. Map Reduce Programming Technique. The following paragraph will make you understand how this technique works over the the line counting problem in 40TB file.
How Map Reduce Works??
You have several machines (Computers) over cluster and each machine is having some block of data. Each of machine will be having respective memory, CPU and disk capacity. Now recall the above line counting problem over 40TB file. The map reduce technique will allow each and every machine over cluster to count the number of lines locally (block of data stored on that particular machine). The job of counting the lines will be done in just few minutes because now you are not moving 40TB of data over single machine rather you are parallely calculating the number of lines locally.
You might be thinking if every computer counts the lines locally then you will individually get count of each and every machine then what about the total count of lines??. Once the machines has their individual counts then the counters from these machines will be moved towards single machine where the final addition will be done and total number of lines in 40TB file will be displayed. You might be thinking that earlier we are moving data and now again we are moving data. Yes, earlier the data size is 40TB but now the data size is just few numbers which will be moving too fast as compare to whole 40TB file. That’s how the technique works and makes easier to handle many Big-Data problems.
Our line counting job is basically divided in following two parts :-
Map Function will be run over each and every block of data stored over machines in Hadoop cluster. Keep in mind that one map function will run one per block over cluster. The job of Map Function is to count the number of lines locally i.e. over each block stored on different machines in cluster. Reduce Function is executed over single machine as it will add the individual counts coming from different machines in cluster and finally display the total count of lines in file. The Map Function and Reduce Function together known as Map Reduce Framework or MR Framework.
Hopefully now you have fair idea of Map Reduce Framework. I will be covering Map Reduce in more detail with examples in upcoming parts. For the time being you may look over HDFS Architecture.