3 V’S OF Big Data
Before going into Big data and Hadoop let me give you a brief introduction about Big Data. The 3 V’s form the Big Data. They are Variety, Volume and Velocity. The image below is describing the same.
Do you know how much data should we call a big data? Well don’t worry, I will explain everything in detail related to Big data and Hadoop. Let’s begin with an example :-
Should we call a 200 GB of Big data?? The Answer is not. Because you have only volume but what about velocity and variety? Refer the three V’s of Big Data mentioned above.
If you are getting 200 GB volume of data per minute and if there is need to store and process it on the same speed then that would be considered as Big Data. So finally, if you have any problem then think of the three V’s. That would be treated as a big data problem.
Hadoop
Hadoop is basically a tool-set which allows us to deal with Big Data. You may find a lot of tools available but the hadoop is one of the most successful , powerful and widely accepted tool used to deal with Big Data. I will not go much in detail with hadoop history. I will directly move your focus about its core components.
Hadoop Core Components
1) HDFS
2) Map Reduce
3) YARN
I will be deailing with all three of them in Detail in upcoming parts. For the time being lets start with HDFS
1) HDFS (Hadoop Distributed File System)
In order to understand HDFS you must have knowledge about File System. In your personal computers you make folders and in that folders, you make txt files, word files etc. After that, you perform some operations like copy , rename, delete or edit a file. All your creation of directories, files and implementation of File operation constitute a File System as shown below.
HDFS is also the same but it is unique Distributed File System with a lot of beautiful features which I will cover later in detail. The following picture give you glimpse of HDFS.
The main Purpose of HDFS is to provide you File Managing Services. You might have noticed that word is repeating so many times i.e. ‘Distributed’. This word makes Hadoop special and its is also the one of the amazing feature of HDFS. The following features make HDFS a unique and powerful file system.
- Distributed
- Scalable
- Cost Effective
- Fault Tolerant
- High Throughput
1. Distributed
Let’s understand it with an example. If the data you want to store in your computer is larger than the storage capacity of your computer then what will you do? I m sure you will buy an new disk of higher capacity or if you have an extra slot then you will add another drive to your computer. This approach in technology is known as vertical scaling.
Let me give you another example:- suppose you want to store data coming from internet so what will you do? In order to store data coming from internet you need to require a huge system and here the concept of vertical scaling fails. The solution of this problem is network approach. The network approach includes more than one computer and which further form a cluster i.e. group of computers. Instead of relying on single machine, use a network of several connected computers in a network. Your data will be distributed among those several computers connected in a network. This approach is also known as horizontal scaling.
In order to implement the horizontal scaling, there is a need of software which combines the storage capacity of entire network in a single unit. It demands a network based file system and the solution is HDFS, which is a Network based file system and Distributed file system. Now you can create a file of size 50TB using HDFS and you don’t have to care about how HDFS is storing that data on individual computers in the network. And then the entire storage capacity of whole network is treated as a single disk and single large computer. It will then make you feel like you are just using a simple single computer. That’s all about the Distributed feature of HDFS.
2. Scalable
If you have noticed carefully then I have already explained about scalable feature. It simply says that if you need more storage then simply add more computers on network. So HDFS is horizontally scalable as we have shown above.
3. Cost Effective
There is no need to buy big large servers for storage. You can purchase simple machines and increase their numbers as per the requirement of storage. This collection of cheap and simple machines makes HDFS as Cost effective.
4. Fault Tolerance
It is the most beautiful feature of HDFS. Later on i will cover its internal working in detail but for time being let’s have some brief understanding of it with an example.
You have multiple computers running in the network. Now there might be a chance that a particular computer in the network gets failed, crashes or the network switch gets failed. As per standards, the whole system will stop working but HDFS is not like that. The fault tolerance feature of HDFS makes it so much powerful that if any fault occurs then the whole system does not get affected by that fault and system will work as normal as it was working.
Now you are thinking that How HDFS does that? That I will discuss in more detail in upcoming parts.
5. High Throughput
Let me make you understand what is Throughput. Throughput is basically the Number of records processed per unit of time. So, HDFS provides you excellent throughput and minimizes the total time to process large Dataset.
I hope you get an idea of Big Data along with HDFS and its features. I will cover remaining two core components of Hadoop with some interesting concepts in upcoming parts. Till Then Keep Reading and Keep Learning.
Comment below if you have any query.
superb.. can’t wait for next