Before reading this article go through Big Data and Hadoop which describes the introduction to Big data and Hadoop HDFS Architecture. This article will let you know about the Big-Data and Hadoop HDFS Architecture in more detail.
HDFS is a distributed file system. So distributed system means that there will certainly be some bunch of computers linked on a network and forming a Hadoop cluster. The purpose of this article is to make you understand how a Hadoop cluster will look like, how a file is created in HDFS and how the data is stored in that file. Let’s look into Hadoop cluster along with some important terminologies.
The above picture shows the structure of a typical hadoop cluster. Hadoop Cluster consists of multiple racks. Each rack is having its own switch. Every switch is connected to a single core switch which in turn is connected to the client machine. Thus forming g network and it is simply called a Hadoop cluster. Below you will find some terms associated with Hadoop cluster.
Big-Data and Hadoop HDFS Architecture
The white dark portion in the following image represents the Rack. Rack is basically a box and multiple computers are fixed into it. Typically, each rack has given its individual power supply and a dedicated network switch. If there is a problem in power supply or if a particular network switch gets failed, then all the computers associated with a rack gets out of network. It shows that there is a possibility that the entire rack can be failed in the network.
HDFS is basically designed by using Master-Slave architecture. In Master-slave architecture one computer from any rack is treated as a master and other computers are treated as slave. The Hadoop master is known as the Name Node and while other computers are known as the Data Node . You may think that why the Hadoop master is known as Name node not hadoop node or master node or super node. It is known as Name Node because it stores and manages names i.e. names of files and directories. Data Node manages the data stored in that files so that’s why they are called as Data Nodes.
Let’s look how the HDFS stores the file in hadoop cluster and how things happened at the back end when we create a file in HDFS.
Mainly there are three actors i.e. (Hadoop Client , Data Node and Name Node) which are responsible for creating a file in HDFS. They are also responsible for reading or writing a data inside the file. The following steps will guide you how the things actually happen at the back end.
Back End Process Involved
- Initially, the hadoop client sends a request to Name Node for creating a file. The client also supply the target directory name and the file name.
- On receiving the request, the Name Node performs some validations like the existence of requested directory, is there any file exists already with name and whether the clients have permissions to create that file or not.
- The Name Node can perform all these checks because it has an image of whole HDFS namespace into memory and it is known as In-memory-FS image i.e. fs-image or File System image.
- If all checks are passed then the Name Node creates an entry for new file in HDFS and returns success to the client. Now the process of creation of a file is over but at present the file is empty because we haven’t written any data into file.
- In-order to write the data into that file the client will create a FSDataOutputStream and start writing data to this stream. FSDataOutputStream is a Hadoop Streamer class responsible for writing data into HDFS files.
- Initially, the data is written by client to the FSDataOutputStream Buffer which is also known as Block. The block is having the default capacity of 128MB. Once the one block gets filled then the data will be written into another block.
- Once the one block gets filled then the streamer class object will reaches out to Name Node for requesting for block allocation. It means the streamer will ask the Name Node where the block will be stored.
- The Name Node does not store data but it knows the amount of free space available in each Data Node. With this information, the Name Node can easily assign the data Node to streamer.
- Now, the streamer knows where to send the data. So, the streamer starts sending data to allocated Data Node by Name Node.
- If your data is more than the block size then the streamer will create an another block of default size(128MB) and will request to Name Node for allocation of Block.
- There may be a chance that you may get another Data Node this time for writing of data and the data will be stored into another Data Node.
- Once you write all data into the file, the Name Node will commit all the changes and your data gets finally stored into the created file in HDFS.
The following image will gives you a glimpse of how the above steps would have worked on.
That’s all about Big Data Hadoop and HDFS Architecture. Hope you have got an excellent view of how the file gets created in HDFS and how the data gets stored into those files. In upcoming parts I will explain you the remaining Components of Hadoop along with some features of Hadoop.
Thank you for reading.