What is Zookeeper?
Zookeeper is just a coordination service which is used by distributed applications to work as a single unit. A distributed application is like a software which runs on several computers connected in a network. Zookeeper can be treated as a centralised storage place where the distributed applications can store and retrieve data out of it. For better understanding, you can imagine zookeeper as a file system in which the znodes is responsible for storing data just like files and directories do in file-system.
Need of Zookeeper in Hadoop
In Hadoop, zookeeper is used to coordinate the distributed applications running on cluster. It is highly possible that the race condition or deadlock can occur because simultaneously many applications are running on single cluster. Race condition arises when two nodes or machines tries to perform two or more than two tasks at a time. The problem of race condition is handled by serialization property of zookeeper. when two or more nodes tries to access the common share resource at a same time then it leads to deadlock. The synchronization property of zookeeper removes the problem of deadlock. Issues like partial failure of a process are very common in distributed applications and it leads to inconsistency in data. Zookeeper handles this issue with atomicity which states that either the whole of process will finish completely or nothing will be stored in case of any failure. Thus, the zookeeper is vital part of Hadoop.
How Zookeeper Works?
You will be surprise to know that the zookeeper follows the simple client-server architecture. The client are nodes which uses the zookeeper service and the servers are nodes which provides these services to client nodes as shown above in image. The multiple server nodes present in zookeeper is known as ensemble. Each server node present in zookeeper maintains a zookeeper transaction log on the local disk which contains the information about all write requests. The transaction log is the vital part of zookeeper because before zookeeper returns a successful response it has to sync the transaction info to the disk. One znode in zookeeper has maximum limit of storing the data as 1MB. However, I have discussed above that zookeeper is just like a file system but it can’t be used as general file systems. Instead, zookeeper is used for storing small amount of data like some configurations etc which provides reliability , availability and coordination to a distributed application.
When a client node issue a write request to server node of zookeeper then one of the server node in zookeeper is chosen as leader. The server to which the client is connected forward the write requests to leader and then the leader passes the requests to all znodes in ensemble. If the majority of nodes respond successfully to the write requests then only the write request is considered to be a success request. Finally, the successful return code is returned to client.
The majority of nodes is generally known as quorum in zookeeper and absence of quorum in ensemble makes zookeeper non-functional.
The read operations are quite quick and scalable as compare to write operations because only one server from ensemble is involved. When a client makes a request to read the content of znode then read operation takes place on the server node to which the client is connected.
The best part in zookeeper is you can have ensemble of any number of znodes. But, generally it is advised to have a ensemble of odd number of nodes in zookeeper. If you have ensemble of only one server node then there will be no existence of highly available and reliable system. If you have three servers then if in case any one of the server fails then you have still two servers as a backup. You can put up a question here what if we have four servers then the answer is if you have four then as soon as your two nodes go down then your zookeeper service also goes down. So, it is recommended that you must have odd number of znodes like 3,5 or 7 in your ensemble.
I hope , Now you have a fair idea of what is zookeeper and how it works. The upcoming parts will make understand about the hadoop installation in single node and multi-node cluster. For Basic understanding of Big Data and Hadoop please click Here.
Thank you for Reading