Apache HBase Tutorial for Beginners

Tech Vidvan

4 years ago

In this HBase Tutorial Article, you will explore one of the most important components of the Hadoop ecosystem, that is, Apache HBase. The article will cover all the basics as well as some advanced concepts related to Apache Hbase.

After reading this article, you will come to know how you can use HBase, Features offered by HBase, HBase history, and many more.

Why do we use HBase?

From the 1970’s, Relational Database Management Systems are the solution for storage and maintenance problems of huge volumes of structured data.

After the advent of big data, organizations realized the advantages of processing big data, and they all started opting for solutions like Hadoop for dealing with big data problems. Apache Hadoop uses a distributed file system that stores big data in a distributed manner, and the MapReduce framework for processing it.

Hadoop is proficient in storing and processing huge volumes of data of different formats such as structured, semi-structured, or even unstructured.But Apache Hadoop can perform only batch processing, and we can access the data only in a sequential manner.

So even for the simplest jobs we have to search the whole dataset. In Hadoop, when we process a huge dataset, then it will result in another huge data set, and we have to process this dataset sequentially.

Due to this reason, Hadoop is not so good for record lookup, updates, and incremental addition of small batches. Thus, there is a need for a new solution that allows us to access any point of data in a unit time.

Hence, the applications such as HBase, couchDB, Cassandra, MongoDB, and Dynamo came into the picture. These are the databases that can store huge volumes of data and allow data access in a random manner.

Let us now see an introduction to HBase.

What is HBase?

HBase is an open-source, distributed, scalable, and a NoSQL database written in Java. HBase runs on the top of the Hadoop Distributed File System and provides random read and write access.

It is a data model that is very similar to Google’s big table designed for providing quick random access to the huge volumes of structured data. HBase leverages the fault tolerance capability provided by HDFS. It is designed for achieving the fault-tolerant way of storing a large collection of sparse data sets.

Apache HBase is the best choice for applications that require fast & random access for the huge amounts of data. This is because HBase achieves high throughput and low latency by providing the faster Read and Write Access on the huge data sets.

We can store data in HDFS either directly or through the HBase. By using HBase, we can read or access the data in HDFS randomly.
It’s time to explore the history of HBase.

History of HBase

Apache HBase is modeled after the release of the research paper on Google’s BigTable, used for collecting data and server requests for several Google services like Maps, Earth, Finance, etc. Google released this research paper in Nov 2006. After this release, the Initial HBase prototype was created in Feb 2007.

Later in January 2008, HBase became the sub-project of Apache Hadoop. In Oct 2008, HBase 0.18.1 was released. Later in Jan 2009, HBase 0.19.0 was released. In Sept 2009, HBase 0.20.0 was released. HBase in 2010 became Apache’s top-level project.

Want to know where we can use HBase? Read below to know the situations where we can use Apache HBase.

Where can we use HBase?

Apache HBase is not suitable for every problem. If we have hundreds of millions or billions of rows and want to read data quickly, then it is best to use HBase. It is mainly used for random, real-time read/write access to the Big Data.

We can use HBase when we want to store huge volumes of data and want high scalability. We can use it only if we can live without all the extra features of traditional database systems like typed columns, transactions, advanced query languages, secondary indexes, etc.

It is a good choice if we are having a lot of versioned data, and we want to store all of them. You can opt for HBase if you want column-oriented data.

Let us now see the storage mechanism in HBase.

HBase Data model

HBase is a column-oriented database. Column-oriented database stores data in cells grouped in columns rather than rows.
Let us see the whole conceptual view of the organization of data in HBase.

1. Table

The table in HBase consists of multiple rows.

2. Row

A row in HBase contains a row key and one or more columns. These columns have values associated with them. HBase sorts the rows alphabetically by the row key.

The main goal is to store the data in such a way that the related rows are nearer to each other. The website domain is used as the common row key pattern.

For example, if our row keys are domains, then we should store them in reverse, that is, org.apache.www or org.apache.mail or org.apache.jira. In this way, all the Apache domains are near to each other in the HBase table.

3. Column

HBase column consists of a column family and a column qualifier delimited by the : (colon) character.

a. Column Family

Column families physically colocate the set of columns and their values. Each column family has a set of storage properties like how its data is compressed, whether its values should be cached in a memory, how its row keys are encoded, and others. Every row in the HBase table has the same column families.

b. Column Qualifier

A column qualifier is added to the column family in order to provide an index for the given piece of data.
For example: If the column family is content, then a column qualifier can be content:html or content:pdf.
Column families are fixed during table creation, but column qualifiers are mutable and can differ greatly between the rows.

4. Cell

A cell is basically a combination of row, column family, and the column qualifier. It contains a value and a timestamp, which represents the value’s version.

5. Timestamp

A timestamp is an identifier for the given value version and is written alongside each value. The timestamp, by default, represents the time on the RegionServer when the data was written. But we can specify a different timestamp value while putting data into the cell.

HBase Architecture

HBase consists of three major components namely, HBase Region Server, HMaster Server, and Regions, and Zookeeper. Let us study each of these components in detail.

1. HBase Region

A Region is basically the sorted range of rows that store data between the start key and the end key. A table in HBase is divided into a number of regions.

The Region default size is 256MB, and we can configure it according to our need. Region Server served a group of regions to the clients. A Region Server can serve 1000 regions (approx) to the client.

2. HBase HMaster

HMaster in HBase handles the collection of Region Server that resides on the DataNode. HBase HMaster performs DDL operations and assigns regions to the Region servers. It coordinates and manages the Region Server.

HMaster assigns regions to the Region Servers during startup and re-assigns the regions to the Region Servers at the time of recovery and load balancing. It is responsible for monitoring all the Region Server’s instances in a cluster.

It does so with the help of the Zookeeper and performs a recovery mechanism if any Region Server goes down. HMaster provides the interface for creating, updating, and deleting tables.

3. HBase ZooKeeper – The Coordinator

The Zookeeper acts as a coordinator inside the HBase distributed environment. It helps in maintaining the server state inside a cluster by communicating through sessions.

Every Region Server, along with the HMaster Server sends a continuous heartbeat at regular intervals to the Zookeeper. Zookeeper checks which server is alive and available.

Zookeeper provides the server failure notifications so that the HMaster can execute the recovery measures. Also, Zookeeper maintains the .META Server’s path. This helps the client who is searching for any region.

4. HBase Meta Table

The META table in Hbase is a special HBase catalog table that maintains the list of all Regions Servers in the HBase storage system. The .META file manages the table in the form of keys and values.

The Key will represent the start key of the HBase region and its id. The value will contain the path of the Region Server.
Refer to the HBase architecture article to study it in detail.

Let us now compare RDBMS and HBase.

Difference between RDBMS and Hbase

1. Definition

RDBMS: RDBMS is a row-oriented database. Each row is the continuous unit of page.no
HBase: HBase is a column-oriented database. Each column is a continuous unit of page.

2. Schema-type

RDMS: RDBMS is governed by its schema. RDBMS schema describes the table structure.
HBase: It is schema-less. HBase does not have any concept of fixed columns schema. It defines only the column families.

3. Transaction support

RDBMS: RDBMS is transactional.
HBase: There are no transactions in HBase.

4. Scalability

RDBMS: Built for small tables and is hard to scale.
HBase: Built for wide tables and is horizontally scalable.

5. Normalization

RDBMS: It has denormalized data.
HBase: It will have normalized data

6. Data type

RDBMS: It is good for structured data.
HBase: It can handle semi-structured as well as structured data.

Difference between HBase and HDFS

1. Definition

HDFS: Hadoop Distributed File System is a Java-based distributed file system that stores huge data across multiple nodes in the Hadoop cluster.
HBase: HBase is a NoSQL database that runs on the top of the Hadoop Distributed File system.

2. Latency

HDFS: HDFS provides high latency operations.
HBase: HBase provides low latency access to the small amounts of data within the large data sets.

3. Random access

HDFS: HDFS provides sequential data access.
HBase: HBase supports random read and writes.

4. Access Through

HDFS: We can access HDFS primarily through MapReduce jobs.
HBase: We can access HBase through the shell commands, Java API, REST API, Avro, or Thrift API.

5. Processing

HDFS: HDFS stores data in a distributed manner and leverages batch processing on that data.
HBase: HBase stores data in a column-oriented format and thus provides faster leveraging real-time processing.

Features of HBase

Atomic read and write: HBase provides atomic read/write on a row level. It means that during one read/write process, it prevents all the other processes from doing any read/write operations.
Consistent reads and writes: Apache HBase provides consistent reads/writes due to the above feature.
Linear and modular scalability: As HBase runs on the top of HDFS so data sets are distributed over HDFS. Due to this HBase is linearly scalable across various nodes. Also, it is modularly scalable.
Easy to use Java API for the client access: HBase provides easy to use Java API for the programmatic access.
Thrift gateway and the REST-ful Web services: HBase also provides the support for the Thrift and REST API for the non-Java front-ends.
Block Cache and Bloom Filters: It supports the Block Cache and the Bloom Filters for the high volume query optimization.
Automatic failure support: Apache HBase with Hadoop Distributed File System provides WAL (Write Ahead Log) across the clusters, which provides automatic failure support.
Sorted rowkeys: HBase stores rowkeys in the lexicographical order. By using the sorted rowkeys and the timestamps, we can build an optimized request.
Data replication: HBase provides data replication across the clusters.
Sharding: HBase offers automatic as well as manual splitting of regions into smaller subregions in order to reduce the I/O time and overhead.

Applications of HBase

We can use HBase in many sectors. Some of the applications of HBase in different sectors are:

Medical: Medical sectors use Apache HBase in order to store genome sequences and run MapReduce on it. They use it for storing the disease history of people and many others.
Sports: We can use HBase for storing match histories in order to perform better analysis and prediction.
Oil and petroleum: Apache HBase is also used in the oil and petroleum industry for storing the exploration data. This is done in order to perform analysis and predict the probable places where oil can be found.
E-commerce: E-commerce sector uses HBase for recording and storing the logs about the customer search history. They do so for performing data analysis so as to target the interested audience for profit gain.
Other fields: We can use HBase in various other fields where we want to store petabytes of data and perform analysis on it for which RDBMS may take months.
Companies such as Facebook, Yahoo, Twitter, Infolinks, and Adobe use Apache HBase internally.

Limitations of HBase

HBase does not provide support for some of the features of traditional database systems.
HBase can’t perform the functions like SQL. It does not support SQL structure due to which it does not have any query optimizer.
HBase is CPU and Memory intensive.
HBase integration with Hive and Pig jobs will result in some time memory issues in the cluster.
In the shared cluster environment, the HBase set up will require fewer task slots per node for allocating HBase CPU requirements.
Costing and maintenance of HBase is too high.
Hbase supports only one default sort per table.
Joining and normalization in the Hbase table is very difficult.

Summary

In short, we can say that HBase is a NoSQL database that runs on top of the Hadoop Distributed File System. It provides BigTable capabilities to the Hadoop framework. HBase consists of HBase Region Server, HMaster Server, and Regions and Zookeeper.

The article had enlisted the difference between HBase and RDBMS, as well as the difference between HBase and HDFS. HBase provides consistent read and writes.

It is open-source and scalable. We can use HBase in many sectors, including medical, sports, e-commerce, and many more.

Follow this HBase tutorial guide to master HBase in every aspect. If you still have any doubts, then do share it in the comment box.