Before going into this blog you should have some knowledge of Fault Tolerance concept of Hadoop. The following blog will make you understand about another beautiful concept of Hadoop i.e. High Availability.

High Availability (HA)

If you see the general meaning of term High Availability i.e. something which is present for a long time. If you talked about Enterprise Applications then the High Availability means the maximum up-time for which the services are running and operational. Data stored in hadoop cluster is highly available because we have seen earlier in part-3 if we lost any data node then the Fault Tolerance through Replication Factor help us to get the data stored in that failed data node. The faults which can occur in a Hadoop cluster are-:

  1. Rack Failure

  2. Data Node Failure

  3. Disk Failure

Have you ever think that what will happen if your Name Node fails? If due to any reason your Name Node gets failed then your entire hadoop cluster will stop working because you know that the client interacts with cluster only with the help of Name Node. So if there is no Name Node in the Cluster then the client will not be able to interact with any data node present in cluster. Hence, there will be no significance of Hadoop cluster.

But we know that Hadoop is highly available which means that hadoop cluster will be available for long time in-spite of having any kind of failure. Lets see how it hadoop cluster is available in-spite of having failure of Name Node.

The Simple solution of above problem is Backup. The backup of following two things must be taken:-

  • HDFS Namespace Information

  • Stand-By Name Node

The following paragraphs will let you how the backup of above things will be taken in Hadoop Cluster.

  • HDFS Namespace Information

We have already seen that the Name Node maintains the entire file system in memory as a image and we called it as In-memory FSImage or FSImage. Name Node also maintains an edit-log in its local-disk. Any time, the Name Node made changes in file system then the changes are recorded in the edit-log. Edit-log is so much powerful that even if have only this file with us then we can reconstruct the updated FSImage by using edit-log. Let see how and where the backup of edit-log is taken. The following picture will give you a glimpse of edit-log.

The backup of edit-log file is taken in QJM (Quorum Journal Manager). The QJM is set of at least three machines. Each of these three machines is configured to execute a Journal-Node Daemon. Journal-Node is very light-weight software that even any of three computers from hadoop cluster can be used in OJM and we do not require any dedicated machine for QJM.

Once we are done with QJM then the next step is to configure our Name Node in such a way that it will start writing the edit-log entries to QJM instead of writing it to the local-disk. You might be wondering that why we have only 3 Nodes in QJM not 4 or 5. The answer is as per requirement. The edit-log file is so much critical that we do not want to store backup of it on single place. That’s the reason generally at-least 3 machines are used, you can used 5 to 7 machines as it depends upon the criticality of edit-log as per your Data. The following image gives you a rough idea of how the edit-log backup is taken.

  • Role of Stand-By Name Node

Stand-By Name Node is nothing, it’s just a machine (Computer) added by us in Hadoop Cluster and named it as Stand-By Name Node. The Stand-By Name Node is also configured in such a way that it will keep reading the edit-log from QJM and keep itself updated. This configuration will make the Stand-By Name Node ready to take the role of Active Name Node (Main Name Node) in just few seconds. It is interesting to know that fact that all the Data Nodes are also configured to send the Block Report (Heartbeat Information) to both the Name Nodes i.e. Stand-By and Active Name Node.

You might be wondering that how the Stand-By Name Node will come to know that the Active Name Node gets failed. Actually what happens is that the Active Node and Stand-By Node contain their one Failure Controllers (FC). The Failure Controller of Active Name Node maintains a lock on Zookeeper which is placed between Active Name Node and Stand-By Node. The Stand-By Node keep on trying to get that lock but initially it doesn’t get because initially the lock is acquired by Active Name Node. Once the Active Name Node gets failed then it releases that lock on Zookeeper and the Stand-By Node gets that lock and will start playing the role of Active Name Node. The following picture shows you how things would have worked on background.


I will be discussing the Zookeeper in Detail in Upcoming Part.

Hope you have got an idea of High Availability. I will cover Secondary Name Node concept along with Zookeeper in upcoming parts, till then keep reading and keep learning. Please feel free to share your views and doubts.

For Basic Knowledge of Big Data and Hadoop please Click Here

Thank you for Reading.



About the author

Dixit Khurana