Site icon TechVidvan

Apache HBase Tutorial for Beginners

Apache HBase Tutorial

In this HBase Tutorial Article, you will explore one of the most important components of the Hadoop ecosystem, that is, Apache HBase. The article will cover all the basics as well as some advanced concepts related to Apache Hbase.

After reading this article, you will come to know how you can use HBase, Features offered by HBase, HBase history, and many more.

 

Why do we use HBase?

From the 1970’s, Relational Database Management Systems are the solution for storage and maintenance problems of huge volumes of structured data.

After the advent of big data, organizations realized the advantages of processing big data, and they all started opting for solutions like Hadoop for dealing with big data problems. Apache Hadoop uses a distributed file system that stores big data in a distributed manner, and the MapReduce framework for processing it.

Hadoop is proficient in storing and processing huge volumes of data of different formats such as structured, semi-structured, or even unstructured.But Apache Hadoop can perform only batch processing, and we can access the data only in a sequential manner.

So even for the simplest jobs we have to search the whole dataset. In Hadoop, when we process a huge dataset, then it will result in another huge data set, and we have to process this dataset sequentially.

Due to this reason, Hadoop is not so good for record lookup, updates, and incremental addition of small batches. Thus, there is a need for a new solution that allows us to access any point of data in a unit time.

Hence, the applications such as HBase, couchDB, Cassandra, MongoDB, and Dynamo came into the picture. These are the databases that can store huge volumes of data and allow data access in a random manner.

Let us now see an introduction to HBase.

What is HBase?

HBase is an open-source, distributed, scalable, and a NoSQL database written in Java. HBase runs on the top of the Hadoop Distributed File System and provides random read and write access.

It is a data model that is very similar to Google’s big table designed for providing quick random access to the huge volumes of structured data. HBase leverages the fault tolerance capability provided by HDFS. It is designed for achieving the fault-tolerant way of storing a large collection of sparse data sets.

Apache HBase is the best choice for applications that require fast & random access for the huge amounts of data. This is because HBase achieves high throughput and low latency by providing the faster Read and Write Access on the huge data sets.

We can store data in HDFS either directly or through the HBase. By using HBase, we can read or access the data in HDFS randomly.
It’s time to explore the history of HBase.

History of HBase

Apache HBase is modeled after the release of the research paper on Google’s BigTable, used for collecting data and server requests for several Google services like Maps, Earth, Finance, etc. Google released this research paper in Nov 2006. After this release, the Initial HBase prototype was created in Feb 2007.

Later in January 2008, HBase became the sub-project of Apache Hadoop. In Oct 2008, HBase 0.18.1 was released. Later in Jan 2009, HBase 0.19.0 was released. In Sept 2009, HBase 0.20.0 was released. HBase in 2010 became Apache’s top-level project.

Want to know where we can use HBase? Read below to know the situations where we can use Apache HBase.

Where can we use HBase?

Apache HBase is not suitable for every problem. If we have hundreds of millions or billions of rows and want to read data quickly, then it is best to use HBase. It is mainly used for random, real-time read/write access to the Big Data.

We can use HBase when we want to store huge volumes of data and want high scalability. We can use it only if we can live without all the extra features of traditional database systems like typed columns, transactions, advanced query languages, secondary indexes, etc.

It is a good choice if we are having a lot of versioned data, and we want to store all of them. You can opt for HBase if you want column-oriented data.

Let us now see the storage mechanism in HBase.

HBase Data model

HBase is a column-oriented database. Column-oriented database stores data in cells grouped in columns rather than rows.
Let us see the whole conceptual view of the organization of data in HBase.

1. Table

The table in HBase consists of multiple rows.

2. Row

A row in HBase contains a row key and one or more columns. These columns have values associated with them. HBase sorts the rows alphabetically by the row key.

The main goal is to store the data in such a way that the related rows are nearer to each other. The website domain is used as the common row key pattern.

For example, if our row keys are domains, then we should store them in reverse, that is, org.apache.www or org.apache.mail or org.apache.jira. In this way, all the Apache domains are near to each other in the HBase table.

3. Column

HBase column consists of a column family and a column qualifier delimited by the : (colon) character.

a. Column Family

Column families physically colocate the set of columns and their values. Each column family has a set of storage properties like how its data is compressed, whether its values should be cached in a memory, how its row keys are encoded, and others. Every row in the HBase table has the same column families.

b. Column Qualifier

A column qualifier is added to the column family in order to provide an index for the given piece of data.
For example: If the column family is content, then a column qualifier can be content:html or content:pdf.
Column families are fixed during table creation, but column qualifiers are mutable and can differ greatly between the rows.

4. Cell

A cell is basically a combination of row, column family, and the column qualifier. It contains a value and a timestamp, which represents the value’s version.

5. Timestamp

A timestamp is an identifier for the given value version and is written alongside each value. The timestamp, by default, represents the time on the RegionServer when the data was written. But we can specify a different timestamp value while putting data into the cell.

HBase Architecture

HBase consists of three major components namely, HBase Region Server, HMaster Server, and Regions, and Zookeeper. Let us study each of these components in detail.

1. HBase Region

A Region is basically the sorted range of rows that store data between the start key and the end key. A table in HBase is divided into a number of regions.

The Region default size is 256MB, and we can configure it according to our need. Region Server served a group of regions to the clients. A Region Server can serve 1000 regions (approx) to the client.

2. HBase HMaster

HMaster in HBase handles the collection of Region Server that resides on the DataNode. HBase HMaster performs DDL operations and assigns regions to the Region servers. It coordinates and manages the Region Server.

HMaster assigns regions to the Region Servers during startup and re-assigns the regions to the Region Servers at the time of recovery and load balancing. It is responsible for monitoring all the Region Server’s instances in a cluster.

It does so with the help of the Zookeeper and performs a recovery mechanism if any Region Server goes down. HMaster provides the interface for creating, updating, and deleting tables.

3. HBase ZooKeeper – The Coordinator

The Zookeeper acts as a coordinator inside the HBase distributed environment. It helps in maintaining the server state inside a cluster by communicating through sessions.

Every Region Server, along with the HMaster Server sends a continuous heartbeat at regular intervals to the Zookeeper. Zookeeper checks which server is alive and available.

Zookeeper provides the server failure notifications so that the HMaster can execute the recovery measures. Also, Zookeeper maintains the .META Server’s path. This helps the client who is searching for any region.

4. HBase Meta Table

The META table in Hbase is a special HBase catalog table that maintains the list of all Regions Servers in the HBase storage system. The .META file manages the table in the form of keys and values.

The Key will represent the start key of the HBase region and its id. The value will contain the path of the Region Server.
Refer to the HBase architecture article to study it in detail.

Let us now compare RDBMS and HBase.

Difference between RDBMS and Hbase

1. Definition

RDBMS: RDBMS is a row-oriented database. Each row is the continuous unit of page.no
HBase: HBase is a column-oriented database. Each column is a continuous unit of page.

2. Schema-type

RDMS: RDBMS is governed by its schema. RDBMS schema describes the table structure.
HBase: It is schema-less. HBase does not have any concept of fixed columns schema. It defines only the column families.

3. Transaction support

RDBMS: RDBMS is transactional.
HBase: There are no transactions in HBase.

4. Scalability

RDBMS: Built for small tables and is hard to scale.
HBase: Built for wide tables and is horizontally scalable.

5. Normalization

RDBMS: It has denormalized data.
HBase: It will have normalized data

6. Data type

RDBMS: It is good for structured data.
HBase: It can handle semi-structured as well as structured data.

Difference between HBase and HDFS

1. Definition

HDFS: Hadoop Distributed File System is a Java-based distributed file system that stores huge data across multiple nodes in the Hadoop cluster.
HBase: HBase is a NoSQL database that runs on the top of the Hadoop Distributed File system.

2. Latency

HDFS: HDFS provides high latency operations.
HBase: HBase provides low latency access to the small amounts of data within the large data sets.

3. Random access

HDFS: HDFS provides sequential data access.
HBase: HBase supports random read and writes.

4. Access Through

HDFS: We can access HDFS primarily through MapReduce jobs.
HBase: We can access HBase through the shell commands, Java API, REST API, Avro, or Thrift API.

5. Processing

HDFS: HDFS stores data in a distributed manner and leverages batch processing on that data.
HBase: HBase stores data in a column-oriented format and thus provides faster leveraging real-time processing.

Features of HBase

Applications of HBase

We can use HBase in many sectors. Some of the applications of HBase in different sectors are:

Limitations of HBase

Summary

In short, we can say that HBase is a NoSQL database that runs on top of the Hadoop Distributed File System. It provides BigTable capabilities to the Hadoop framework. HBase consists of HBase Region Server, HMaster Server, and Regions and Zookeeper.

The article had enlisted the difference between HBase and RDBMS, as well as the difference between HBase and HDFS. HBase provides consistent read and writes.

It is open-source and scalable. We can use HBase in many sectors, including medical, sports, e-commerce, and many more.

Follow this HBase tutorial guide to master HBase in every aspect. If you still have any doubts, then do share it in the comment box.

Exit mobile version