Sqoop HCatalog Integration

In this article, you will explore each and everything related to the Sqoop HCatalog Integration. The article will first provide an introduction to Sqoop HCatalog Integration. Then you will see the new command-line option and various Hive options.

The article will also enlist some unsupported and ignored Hive and Sqoop options. At last, you will see the examples which show how you can create an HCatlog table and invoke Sqoop import and export.

Let us first start with an introduction to Sqoop HCatalog Integration.

Sqoop HCatalog Integration

Introduction to Sqoop HCatalog Integration

HCatalog is a table and the storage management service for Apache Hadoop, which enables the users with different data processing tools such as Hive, Pig, MapReduce to read and write data on a grid with ease.

The HCatalog table abstraction provides the relational view of data stored in HDFS to the users. It ensures users not to worry about the location where and the format in which the data is stored. It can store data in any format like RCFile format, text files, or SequenceFiles.

HCatalog provides support for reading and writing files in any of the formats for which the Hive SerDe, that is, serializer- deserializer has been written.

HCatalog, by default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. For using the custom format, you have to provide the InputFormat, OutputFormat as well as the SerDe.

Exposing HCatalog tables to Apache Sqoop

The integration of HCatalog with Apache Sqoop is patterned on the existing feature set, which supports Hive and Avro tables.

For exposing HCatalog tables to Apache Sqoop, the five new command-line options were introduced. It also re-uses some existing Apache Hive command-line options.

Let us now explore the new command-line options.

New Command-Line Options

Command-line Option Description
–hcatalog-database This option specifies the name of the database for the HCatalog table. If we do not specify this option, then it will use the default database name, which is ‘default’. If we provide the –hcatalog-database option without the –hcatalog-table option, then it is an error. This option is not the required option.
-hcatalog-table The argument specifies the HCatalog tablename. It is the required option for HCatalog jobs. The home directory must have the lib subdirectory and the share/hcatalog subdirectory with the required HCatalog libraries. 
–hcatalog-home This option specifies the home directory for the HCatalog installation. If we don’t specify this option, then it checks the system environment variable, which is HCAT_HOME , and if it fails, then it will check a system property hcatalog.home. 

If we have not set either of these properties, then it will use the default value. The default value is set to /usr/lib/hcatalog. 

It is also not a required option.

–create-hcatalog-table This option specifies whether the HCatalog table should be created automatically while importing the data. 

HCatalog tables, by default, are assumed to exist. 

The name of the table will be the same as the name of the database translated to lower case. 

–hcatalog-storage- stanza This option specifies the storage stanza, which is to be appended to the table.

Supported Sqoop Hive Options

Some of the Sqoop Hive options are also used with the –hcatalog-table option for providing additional input to the HCatalog jobs.

Command-line Option Description
–map-column-hive This option maps the database column to the HCatalog with the specific HCatalog type.
–hive-home It specifies the Hive home location.
–hive-partition-key This option specifies to use static partitioning filters. The type of partitioning key should be STRING. 

Note: There can be only 1 static partitioning key.

–hive-partition-value This option specifies the value associated with the partition.

Direct Mode Support

HCatalog integration in Apache Sqoop is now enhanced to provide support for the direct mode connectors. The Direct mode connectors are basically the high-performance connectors specific to the particular database.
The Netezza direct mode connector is enhanced to use this feature for HCatalog jobs.

Let us now learn Unsupported options.

Unsupported Options

1. Some of the sqoop Hive options are not supported with the HCatalog jobs. They are:

  • –hive-import
  • –hive-overwrite

2. Also, some of the Sqoop export and import options are also not supported with the HCatalog jobs. They are:

  • –direct
  • –export-dir
  • –target-dir
  • –warehouse-dir
  • –append
  • –as-sequencefile
  • –as-avrofile

3. Ignored Sqoop Options: The Sqoop options which are ignored are:

  • All input delimiter options.
  • Output delimiters except –hive-drop-import-delims or –hive-delims-replacement is used.

Automatic Table Creation

Sqoop manages and creates the table metadata while importing tables into Apache Hadoop. The HCatalog import jobs also provide support for this feature with an option –create-hcatalog-table.

One of the major advantages of HCatalog integration is to provide storage agnosticism to Apache Sqoop data movement jobs. For providing this feature, HCatalog imports the jobs provided by specifying an option that lets the user specify the storage format for the table created.

We can use the option –hcatalog-storage-stanza for specifying the storage format of the table created newly.
By default, the value of this option is set to stored as rcfile.

If we specify the option –hive-partition-key, then the value of this argument is used as the partitioning key for a newly created table. We can specify only one partitioning key with this option.

HCatalog Table Requirements

If the default table creation arguments with the optional storage stanza are not enough, then we should create the HCatalog table before using it as a part of the Sqoop job. We can use all the storage formats supported by HCatalog with the creation of the HCatalog tables.

Support for Partitioning

The Sqoop HCatalog integration feature provides support for the following table types:

  • Partitioned tables with the static partitioning key specified
  • Unpartitioned tables
  • Partitioned tables with the dynamic partition keys from the database result set
  • Partitioned tables with the combination of the static key and the additional dynamic partitioning keys

Support for HCatalog Data Types

It provides support for all the primitive HCatalog types. At present, all the complex HCatalog types are not supported.

The BLOB or CLOB database types are only supported for the imports.

Providing Hive and HCatalog Libraries for the Sqoop Job

In Sqoop HCatalog integration, any HCatalog job depends on the set of jar files being available on the Sqoop client host as well as on the node where the Map/Reduce tasks run.

For running the HCatalog jobs, the environment variable, that is, HADOOP_CLASSPATH must be set up before launching the Sqoop HCatalog jobs.

HADOOP_CLASSPATH=$(hcat -classpath) export HADOOP_CLASSPATH

The Sqoop will copy the required HCatalog dependencies to the distributed cache automatically.

Example Invocations

1: We can create an HCatalog table as:

hcat -e "create table emp(emp_name string, emp_id string, emp_salary float, dept_id int) partitioned by (emp_id string) stored as rcfile;"

2: We can invoke the Sqoop import of the “emp” HCatalog table as follows:

$SQOOP_HOME/bin/sqoop import --connect <jdbc-url> -table <table-name> --hcatalog-table emp <other sqoop options>

3: We can invoke the Sqoop export of the “emp” HCatalog table as follows:

$SQOOP_HOME/bin/sqoop export --connect <jdbc-url> -table <table-name> --hcatalog-table emp <other sqoop options>

Summary

I hope after reading this article, you clearly understand the Sqoop HCatalog Integration. Now you can create HCatalog tables and invoke Sqoop import and export jobs on them.

If you have any queries regarding Sqoop HCatalog Integration, then do share in the comment section.