Hive Tutorial – Introduction to Apache Hive
Apache Hive is an open-source tool on top of Hadoop. It facilitates reading, writing, and managing large datasets that are residing in distributed storage using SQL. In this Hive Tutorial article, we are going to study the introduction to Apache Hive, history, architecture, features, and limitations of Hive.
The Hive Tutorial covers the following points:
- Hive Introduction
- Hive History
- Why to use Apache Hive
- Hive Architecture
- Data models
- Hive Features
- Hive Limitations
- Career with Hive
Apache Hive is an open-source data warehousing infrastructure based on Apache Hadoop. It is designed for summarizing, querying, and analyzing large volumes of data. It abstracts the complexity of MapReduce jobs.
Initially, we have to write complex MapReduce jobs, but with the help of Hive, we just need to submit the SQL like queries (HQL), which are then converted to MapReduce jobs. Thus, with Hive, the developers are not required to write complex Java programs.
Hive uses Hive Query Language (HQL) similar to SQL for querying and analyzing structured or semi-structured data. It focuses on users who are comfortable with SQL. Hive is best for traditional data warehousing tasks.
Before 2008, all of the data processing infrastructure at Facebook was built around a data warehouse using RDBMS. The data at Facebook was growing at the rate of 15 TBs/day in 2007, and in a few years, it increased around 2 PB/day. The infrastructure at that time takes a day to process the daily data processing jobs. So they were searching for the infrastructure that scales along with their data.
Due to Hadoop being open-source and scalable, Facebook started using Hadoop. With Hadoop, they were able to process the daily data processing jobs within a few hours.
However, using Hadoop was not easy for end-users, especially for the ones who are not familiar with the MapReduce. End-users have to write complex MapReduce jobs for simple tasks like counts, averages, etc. So Facebook tried to bring the query capabilities to the Hadoop while still maintaining the extensibility and flexibility of Hadoop. This leads the data infrastructure team at Facebook to develop a Hive. Hive then gets very popular with all the users internally at Facebook.
Later on, in August 2008, Hive was open-sourced. Now many other companies like Amazon, Netflix, etc. use Hive. The latest version of Hive is Hive 3.1.2, which was released on 26 August 2019.
Why use Apache Hive
Apache Hive combines the advantage of both the SQL DataBase System and the Hadoop MapReduce framework.
With Hive, one can smoothly perform data analysis without writing the complex MapReduce jobs. We can use Hive for analyzing, querying, and summarizing large volumes of data. It is best suited for traditional data warehousing tasks where users can perform data analytics and data mining that does not require real-time processing.
It is easily possible to couple Hive queries to different Hadoop Packages like RHive, RHipe, and even Apache Mahout.
For example, We can use Tableau and Hive integration for Data Visualization; Apache Tez, along with Hive, will provide real-time processing capabilities, etc.
Also, Hive helps the developers while working with complex analytical processing and challenging data formats. Hive SQL can be extended with user-defined functions.
Due to these reasons, large numbers of companies, including Amazon, Netflix, etc. use Apache Hive.
The major components of Apache Hive are:
- Hive Client
- Hive Services
- Processing and Resource Management
- Distributed Storage
Hive provides support for the applications written in any programming language like C++, Python, Java, etc. by using the JDBC, ODBC, and the Thrift drivers, for performing any queries on the Hive. Hence, one can easily write a hive application in any language of its own choice.
We can categorize Hive clients into three types:
- Thrift Clients: As Hive server is based on Apache Thrift, so it can serve the request from a thrift client.
- JDBC client: JDBC client is a java application that supports JDBC protocol. Hive allows these applications to connect to it by using the JDBC driver.
- ODBC client: ODBC client is the client application that supports ODBC protocol. Hive ODBC Driver allows such applications to connect to Hive.
To perform all queries, Hive provides various services like Beeline shell Hive Server2, etc.
It is a command shell provided by Hive Server 2 that allows users to submit hive queries and commands.
2. Hive Server 2
It is the successor of HiveServer1. It enables clients to execute the queries against Apache Hive. Hive Server 2 permits multiple clients to submit a request to Apache Hive and retrieve final results.
The Hive driver accepts the HQL statements submitted by a Hive user via Beeline shell and creates session handles for the query. It then sends a Hive query to the Hive compiler.
Hive compiler is the one who parses the Hive query. It is responsible for performing semantic analysis and type-checking. It generates the execution plan in the form of DAG(Directed Acyclic Graph).
Optimizer is responsible for performing the transformation operations on the execution plan. It splits the task in order to improve the efficiency and scalability.
Execution engine executes the task after the compilation and optimization steps in order of their dependencies using Hadoop.
It is a central repository that stores the metadata information related to the structure of tables and partitions, including the column and the column type information. Metastore also stores the information of the serializer and the deserializer required for read/write operation. This metastore is normally a relational database.
Processing Framework and Resource Management
Internally, Hive uses the MapReduce framework as the defacto engine in order to execute the queries.
Hive uses the Hadoop Distributed File System for distributed storage.
Go through HDFS Introduction article to study HDFS.
Hive Data Models
Hive organizes the Data into :
Tables in Hive are similar to the tables in the Relational DataBases. We can perform filters, join, project, and union on the Hive tables. All the data of a table is stored as a directory in HDFS. You can create a table on pre-existing files or directories in HDFS by giving proper location on the table creation DDL due to external table notions supported by Hive.
Hive organizes tables into partitions based on partition keys for grouping similar data together. These partitions are nothing but just the subdirectories in the table directory. This allows us to do a fast query on the slices of data. Hive table can have one or more partition keys to identify partitions in the table.
In Hive, each partition can be further organized into buckets based on the hash function of a column in the table. These buckets are stored as a file in the partition directory. The unpartitioned table can also be divided into buckets where each bucket is stored as a file in a table directory.
Go through the Data Model article to study the Hive Data Model in detail.
- Apache Hive provides easy access to data through SQL like queries, thus we can use it for data warehousing tasks such as ETL (extract/transform/load), reporting, and data analysis.
- Using Apache Hive, we can impose structure on a variety of data formats.
- Apache Hive facilitates access to the files stored either directly in HDFS or in other data storage systems such as HBase.
- Query execution via Apache Spark, Apache Tez or MapReduce.
There are some limitations with Apache Hive. They are:
- Hive is best for working with batch jobs, so we cannot use it for Online Transaction Processing. We can use it for Online Analytical Processing.
- Hive does not offer real time queries for row level updates.
- Latency of Apache Hive query is generally very high.
Career with Hive
Apache Hive is a much-in-demand skill to master if you are looking to work in the Big Data Hadoop world.
Currently, most of the top MNC’s and enterprises are looking for people with the right set of skills for analyzing and querying vast volumes of data.
Thus, learning Apache Hive is a milestone to command top salaries in some of the top organizations in the world.
In this article, you learned about Apache Hive that is present on top of Hadoop for Online Analytical Processing. It uses a SQL -like query language called Hive Query Language (HQL) and translates that HQL queries into relevant MapReduce jobs. Hive increases the developer’s productivity. Hive abstracts the complexity of writing MapReduce jobs. In this article, you had also seen the architecture of Apache Hive along with the Hive features and limitations.