Site icon TechVidvan

Impala Tutorial for Beginners

Apache Impala Tutorial

Impala is an open-source and a native analytic database for Hadoop. Vendors such as Cloudera, Oracle, MapR, and Amazon shipped Impala. If you want to learn each and everything related to Impala then you have landed in the right place.

This Impala tutorial will explain the main points and concepts related to Impala. It is basically used to process huge amounts of data at fast speed by using traditional SQL knowledge.

For learning Impala, you should be little familiar with the basics of Apache Hadoop, and the HDFS commands. The basic knowledge of SQL is the plus point in learning Impala.

 

What is Impala?

Impala is an open-source and the native analytic database for Hadoop. It is a Massive Parallel Processing (MPP) SQL query engine that processes vast amounts of data stored in the Hadoop cluster.

When compared to the other SQL engines for Apache Hadoop, such as Hive, Impala provides high performance and low latency.

In simpler words, we can say that Impala is the best performing SQL engine that provides the fastest way for accessing the data stored in the HDFS (Hadoop Distributed File System). Impala is written in C++ and Java.

Apache Impala raises the bar for the SQL query performance on Hadoop while retaining the familiar user experience. With Apache Impala, we can query data stored either in HDFS or in Apache HBase. We can perform operations like SELECT, JOIN, and aggregate functions in real-time with Impala.

Apache Impala uses the same SQL syntax (Hive Query Language), metadata, user interface, and ODBC drivers as Apache Hive thus provides a familiar and unified platform for the batch-oriented or the real-time queries.

This allows Hive users to utilize Apache Impala with the little setup overhead. However, all the SQL-queries are not supported by Impala, there can be a few syntactic changes. The Impala Query Language is the subset of the Hive Query Language, with some functional limitations such as transforms.

Reasons for using Apache Impala

1. Apache Impala combines flexibility and scalability of Hadoop with the SQL support and multi-user performance of the traditional analytic database, by using components such as HDFS, Metastore, HBase, Sentry, and YARN.

2. With Apache Impala, users can easily communicate with the HDFS or HBase by using SQL like queries in a faster way as compared to the other SQL engines like Apache Hive.

3. Apache Impala can read almost every file format like Parquet, RCFile, Avro, used by Apache Hadoop.

4. Moreover, it uses the same SQL syntax (Hive SQL), metadata, user interface, and ODBC driver as Apache Hive, thus providing a familiar and unified platform for the batch-oriented or the real-time queries.

5. Also, Impala is not based on the MapReduce algorithms like Apache Hive does. It implements a distributed architecture based on the daemon processes that are responsible for all the aspects of the query execution. Thus this reduces the latency of using MapReduce and makes Apache Impala faster and better than Apache Hive.

Where to use Apache Impala

Architecture of Apache Impala

The above image depicts Impala Architecture. Apache Impala runs on the number of systems in the Apache Hadoop cluster. Unlike the traditional storage systems, Apache impala is not coupled with its storage engine.

It is decoupled with its storage engine. Impala has three core components, that are, Impala daemon (Impalad), Impala Statestore, and the Impala Catalog services.

Let us see each one in detail.

1. Impala Daemon

Impala daemon is the core component of Apache Impala. It is physically represented by an impalad process. Impala daemon runs on each machine where Impala is installed. The main functions of Impala daemon are:

  1. Any of the Impala daemon either creates, drops, or alters, any type of object.
  2. When Impala processes the INSERT or the LOAD DATA statement.

We can deploy Impala daemons in one of the following ways:

2. Impala Statestore

Impala Statestore is the one who checks on the health of all the Impala daemons in the cluster, and continuously passes on its findings to each of the Impala daemons. The Impala Statestore is physically represented by the daemon process named statestored.

We only need a statestore process on one host in the cluster. So, if any of the Impala daemon goes offline because of a network error, hardware failure, software issue, or any other reason, then the Impala StateStore informs all the other Impala daemons.

This is done to ensure that the future queries do not make requests to the failed Impala daemon.

Impala Statestore is not always critical for the normal operation of the Impala cluster. If in case the StateStore is not running, then also the Impala daemons will run and distribute the work among themselves as usual.

In this case, the cluster becomes less robust when the other Impala daemons fail, and the metadata will become less consistent. When the Impala StateStore comes back, then it re-establishes the communication with all the Impala daemons and continues its monitoring and broadcasting functions.

3. Impala Catalog Service

The Catalog Service is another Impala component that passes the metadata changes from the Impala SQL statements to all the Impala daemons in the cluster. The Impala Catalog Service is physically represented by the daemon process named catalogd.

We only need a catalogd process on the one host in the cluster. Since the requests are passed via the StateStore daemon, it is best to run the statestored and catalogd process on the same host.

The Impala catalog service prevents the need for issuing REFRESH and the INVALIDATE METADATA statements when the metadata changes were performed by the statements issued through Apache Impala.

When we create a table or load data through Apache Hive, then we have to issue the REFRESH or the INVALIDATE METADATA on the Impala node before executing any query there.

Note:

The features like load balancing and high availability are applied to an impalad daemon. The process such as statestored and catalogd do not have any special requirements for the high availability, because failure in any of them does not result in the data loss.

If any of these two daemons become unavailable because of an outage on the particular host, then we can stop the Impala service by:

Impala Query Processing Interfaces

For processing the queries, Apache Impala provides the three interfaces, which are:

1. Impala-shell − After setting up Apache Impala by using the Cloudera VM, we can start this shell by using the command impala-shell in an editor.

2. Hue interface − We can also process the Impala queries by using the Hue browser. We have an Impala query editor In the Hue browser where we can type and execute Impala queries. For accessing the Impala editor in the Hue browser, we have to log in into the Hue browser.

3. ODBC/JDBC drivers − Apache Impala also provides the ODBC/JDBC drivers, just like the other databases. By using these drivers, we can easily connect to the Impala through the programming languages which support these drivers and build applications that process the queries in impala.

How is a query executed in Impala?

Whenever we pass any query to Impala by using any Impala query processing interfaces, an Impala Daemon (Impalad) in the cluster accepts the query. This Impalad, that is, Impala Daemon is then treated as the coordinator node for that specific query.

After receiving the query, a query coordinator will verify whether the query submitted by the user is appropriate. It does so by using Table Schema from the Hive meta store.

After verifying the query, it collects the information about the data location, which is required for executing the query. It collects the information from the HDFS NameNode and sends this information to the other Impala Daemons for executing the query.

All the other Impalad in the cluster reads the specified data block and executes the query. When all the Impala daemons complete their tasks, then the query coordinator will collect the result back and then deliver it to the user.

Features of Apache Impala

The key-features of Impala are−

Impala Shell Commands

The Impala Shell Commands were classified into three categories:

  1. General Commands
  2. Query specific options
  3. Table and Database-specific options

Let us now explore each in detail.

1. General Commands

a. help : The Impala shell command help gives us the complete list of all the commands available in Apache Impala.

b. version : The Impala shell command version gives us the current version of the Impala.

c. history : The Impala shell command history displays the last 10 commands that are executed in the Impala shell

d. connect : The Impala shell command connect is used for connecting to the given instance of Impala. If we do not specify any instance, then by default, it will connect to the default port 21000.

e. exit | quit : We use the exit or quit command for coming out of the Impala shell.

2. Query specific options

a. profile: The Impala shell command profile displays the low-level information related to the recent query. We used this command for diagnosis and the performance tuning of the query.

b. explain: The Impala shell command explain returns the execution plan for a given query.

3. Table and Database specific options

a. alter: The Impala shell command alter modifies the structure and the name of the table in Impala.

b. describe: The Impala shell command describe gives the metadata information of a table. It contains information such as columns and their data types.

c. drop: The Impala shell command drop removes the construct from Impala. The construct can be a table, a database function, or a view.

d. insert: The Impala shell command insert appends the data (columns) into the table or for overriding the data of the existing table.

e. select: The Impala shell command select is used for performing the desired operation on the particular dataset. This command selects the dataset on which we have to perform some action.

f. show: The Impala shell command show displays the metastore of the constructs like tables, databases, etc.

g. use: The Impala shell command use changes the current context to the desired database.

Similarities between Apache Impala, Hive, and Hbase

The similarities between Apache Impala, Hive, and Hbase are:

Let us now explore the difference between Apache Impala, Hive, and Hbase.

Difference between Apache Impala, Hive, and Hbase

Though Apache Impala uses the same metastore, query language, user interface as Hive, but it differs from Hive and HBase in certain aspects. The below table enlists the differences among the Impala, Hive, and HBase.

Impala Hive HBase
Apache Impala is a tool designed for managing and analyzing the data stored in Hadoop. Apache Hive is a data warehouse tool that can be used for accessing and managing the large distributed datasets in Hadoop.  HBase is the wide-column store database that is based on Apache Hadoop. HBase uses the concepts of Google BigTable.
It follows the Relational model. It also follows the Relational model. HBase follows the wide column store model.
It’s Written in C++. Written in Java. Written in Java.
Impala has a Schema-based data model. Hive also has a Schema-based data model. HBase has a schema-free data model.
Apache Impala provides the  JDBC and the ODBC API’s. Apache Hive provides the JDBC, ODBC, and Thrift API’s. HBase provides the Java, RESTful, and the Thrift API’s.
Apache Impala supports all the languages supporting JDBC and ODBC. Apache Hive supports programming languages such as Java, C++, PHP, and Python. HBase support the programming languages such as  C, C++, C#, PHP, Python, Groovy, Java, and Scala.
Impala doesn’t support triggers. Hive also doesn’t support triggers. HBase provides support for triggers.
Released in 2013. It released in 2012. Released in 2008.
Current release was 3.3.0 Its Current release was 3.1.2 Current release was 2.2.0

Advantages of Apache Impala

The main advantages of Apache Impala are:

Disadvantages of Impala

The significant limitations of Apache Impala are:

Summary

In short, we can say that Impala is an open-source and the native analytic database for Hadoop. Impala is the best performing SQL engine that provides the fastest way for accessing the data stored in the HDFS (Hadoop Distributed File System).

Impala uses the same SQL syntax (Hive Query Language), metadata, user interface, and ODBC drivers as Apache Hive. Unlike the traditional storage systems, Apache impala is not coupled with its storage engine.

Apache Impala consists of three core components Impala daemon, Impala Statestore, and Impala Catalog services. Impala Shell, Hue browser, and JDBC/ODBC driver are the three query processing interfaces that we can use for interacting with Apache Impala.

The article had also described Impala’s various shell commands.

Exit mobile version