Significance of Apache Sqoop Import Mainframe Tool

The Sqoop Import Mainframe is a tool for importing all the sequential datasets in a partitioned dataset (PDS) on the mainframe to the HDFS. This Sqoop Tutorial first explains what Sqoop Import Mainframe is, its purpose, syntax etc.

Later on, you will explore the possible arguments for this tool. Moreover, the article also covers examples explaining this tool so that you can understand this topic in a better manner.

Sqoop import mainframe

What is Sqoop Import Mainframe?

  • Sqoop Import Mainframe is a tool that imports all the sequential datasets in a partitioned dataset (PDS) on the mainframe to the HDFS.
  • A partitioned dataset (PDS) is similar to a directory in the open systems.
  • The records in the dataset contain only the character data.
  • The records are stored with the whole record as the single text field.

Syntax of Sqoop Import Mainframe Tool

The Syntax for the Sqoop Mainframe tool is:

$ sqoop import-mainframe (generic-args) (import-args)
$ sqoop-import-mainframe (generic-args) (import-args)

We can enter the import arguments in any order corresponding to each other, but we have to enter the Hadoop generic arguments before any import arguments.

The common arguments are:

Argument Description
–connect <hostname> This argument specifies the mainframe host to connect to. 
–connection-manager <class-name> This argument specify the connection manager class which is to use
–hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
–help It will print the usage instructions
–password-file It will set the path for a file which contains the authentication password
-P It will read the password from a console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose It will print more information when working
–connection-param-file <filename> Specify the optional properties file which provides the connection parameters

Let us now explore different steps that one can go through while using the Sqoop Import Mainframe tool.

Connecting to a Mainframe

We can use Sqoop for importing mainframe datasets into the HDFS. For doing so, we have to specify the mainframe hostname in a Sqoop –connect argument.

$ sqoop import-mainframe --connect z390

On entering the above command, we will get connected to the mainframe host z390 via ftp.
Sometimes, we have to authenticate against the mainframe host for accessing it. For doing so, we have to supply the username via the –username option.

There are several different ways available in Sqoop for supplying a password to the mainframe. Some of the Secure ways of supplying password to the mainframe are:

1. We can save the password in a file stored on the user’s home directory with the 400 permissions. After storing the file, we have to specify the path to that file by using the argument –password-file. It is the most preferred method for entering credentials.

Apache Sqoop then reads the password from this file and then passes it to the MapReduce cluster. It does so by using secure means without exposing the password in the job configuration.

This file, which contains the password, can be stored either on the Local FileSystem or on the HDFS.
Example:

$ sqoop import-mainframe --connect z390 \
    --username dataflair --password-file ${user.home}/.password

2. We can also supply the password by using the -P argument that will read the password from the console prompt.
Example:

$ sqoop import-mainframe --connect z390 --username dataflair --password 123

Selecting the Files to Import

The Import control arguments are:

Argument Description
–as-avrodatafile Imports data to the Avro Data Files
–as-sequencefile Imports data to the SequenceFiles
–as-textfile Imports data as a plain text (default)
–as-parquetfile Imports data to the Parquet Files
–delete-target-dir It will delete an import target directory if the directory exists
-m,–num-mappers <n> Use n map tasks for importing in parallel
–target-dir <dir> HDFS destination dir
–warehouse-dir <dir> HDFS parent for the table destination
-z,–compress Enable compression
–compression-codec <c> Use the Hadoop codec (default gzip)

We can use –dataset argument for specifying the partitioned dataset name. All the sequential datasets present in the partitioned dataset will get imported.

Controlling Parallelism

Apache Sqoop imports data in parallel by creating multiple ftp connections to the mainframe for transferring multiple files simultaneously.
By using -m or –num-mappers argument, we can specify the number of mappers (parallel processes) used for performing the import.

Both of these arguments take the integer value corresponding to the degree of parallelism.
Sqoop, by default, uses four map tasks. We can adjust this value for maximizing the data transfer rate from the mainframe.

Controlling Distributed Cache

Apache Sqoop copies the jars in the $SQOOP_HOME/lib folder to the job cache every time while starting a Sqoop job. When we launch it by Oozie, then it is unnecessary because Oozie uses its own Sqoop share lib, which keeps the Sqoop dependencies in a distributed cache.

By using the argument –skip-dist-cache in Sqoop command when launched by the Oozie, it will skip the step by which the Sqoop copies all its dependencies to the job cache and save the massive I/O.

Controlling the Import Process

Sqoop, by default, imports all the sequential files present in the partitioned dataset pds to a directory named pds inside the home directory in the HDFS.
For example, if our username is someuser, then the Sqoop import tool will write to the /user/someuser/pds/(files).
We can adjust the parent directory of import with the argument –warehouse-dir.

For example:

$ sqoop import-mainframe --connnect <host> --dataset foo --warehouse-dir /shared \
    ...

The above command will write to a set of files present in the /shared/pds/ directory.

Also, we can explicitly choose the target directory like:

$ sqoop import-mainframe --connnect <host> --dataset foo --target-dir /dest \
    ...

This above command will import the files into a /dest directory. The argument –target-dir is incompatible with the —warehouse-dir argument.
The import, by default, goes to a new target location. If in case the destination directory already exists in the HDFS, then Sqoop will refuse to import and overwrite the content of that directory.

Sqoop File Formats

Each record in the dataset is stored as a text record with the newline at the end. It is assumed that each record contains the single text field with the name DEFAULT_COLUMN. When Apache Sqoop imports the data to the HDFS, it generates a Java class that reinterprets the text files that it creates.

We can also import the mainframe records to the Sequence, Parquet, or Avro files. By default, while importing the data is not compressed.

We can compress our data either by using the deflate (gzip) algorithm with the –compress or -z argument, or we can specify the Hadoop compression codec by using the –compression-codec argument.

The Output line formatting arguments:

Argument Description
–enclosed-by <char> It will set the required field enclosing character
–escaped-by <char> It will set the escape character
–fields-terminated-by <char> It will set the field separator character
–lines-terminated-by <char> It will set the end-of-line character
–mysql-delimiters Uses the MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ‘
–optionally-enclosed-by <char> It will set the field enclosing character

The Input parsing arguments are:

Argument Description
–input-enclosed-by <char> Sets a required field encloser
–input-escaped-by <char> Sets the input escape character
–input-fields-terminated-by <char> Sets the input field separator
–input-lines-terminated-by <char> Sets the input end-of-line character
–input-optionally-enclosed-by <char> Sets a field enclosing character

Importing Data Into Hive

We use Sqoop mainly for uploading table data into HDFS. But if we have a Hive metastore associated with our HDFS cluster, then also we can use Apache Sqoop.

Sqoop imports the data into the Hive by generating and executing the CREATE TABLE statement for defining data’s layout in the Hive. We can import data into Hive just by adding the option –hive-import in our Sqoop command line.

If in case the Hive table already exists, then we can specify –hive-overwrite option that indicates the replacement of the already existing table in hive.

When the Sqoop job wants to import data into HDFS, then Sqoop will generate the Hive script that contains the CREATE TABLE operation that defines our columns using the Hive’s types, and the LOAD DATA INPATH statement for moving data files into the Hive’s warehouse directory.

This script is executed by calling an installed copy of Hive on the machine where Sqoop is running. If we have multiple Hive installations, or the hive is not in our $PATH, then we use the option –hive-home for identifying Hive installation directory. Apache Sqoop will use $HIVE_HOME/bin/hive from here.

We can import data for Hive into the particular partition by specifying the arguments, –hive-partition-key and –hive-partition-value arguments.

By using the –compress and –compression-codec options, we can import compressed tables into the Hive.

The Hive arguments are:

Argument Description
–hive-home <dir> Override $HIVE_HOME
–hive-import Import tables into Hive. It uses the Hive’s default delimiters if none are set.
–hive-overwrite Overwrite the existing data in the Hive table.
–create-hive-table If we set this option, then the Sqoop job will fail if a target hive table exits. This property, by default, is set to false.
–hive-table <table-name> It will set a table name to use while importing to Hive.
–hive-drop-import-delims It will drop the \n, \r, and \01 from the string fields while importing to Hive.
–hive-delims-replacement It will replace the \n, \r, and \01 from the string fields with the user defined string while importing to Hive.
–hive-partition-key It will specify the name of the hive field to which the partition are sharded on
–hive-partition-value <v> It will specify the String-value which serves as a partition key for this imported into the hive in this job.
–map-column-hive <map> Override the default mapping from the SQL type to the Hive type for the configured columns.

Importing Data Into HBase

Apache Sqoop can import the records into the table in HBase as well. For importing a table to HBase instead of any directory in HDFS, we have to specify the –hbase-table option in the Sqoop command.

Apache Sqoop will import the data to a table specified as an argument to the –hbase-table option. Each row of an input table is transformed into the HBase Put operation to the row of the output table. For each row, it takes the key from the column of the input.

Sqoop, by default, uses a split-by column as a row key column.

If the split-by column is not specified, then it will try to find the primary key column. We can also manually specify the row key column with the –hbase-row-key. Every output column is placed in the same column family specified with –column-family.

While importing data into HBase, if the target table and the column family don’t exist, then the Sqoop job exits with an error.

So if you are importing using the –hbase-table option, then you have to create the target table and the column family before running the import.
If we specify the option –hbase-create-table, then the Sqoop will itself create the target table and the column family if they don’t exist.

The HBase arguments are:

Argument Description
–column-family <family> It will set a target column family for the import
–hbase-create-table If we specify this option then it will create missing HBase tables. 
–hbase-row-key <col> It will specify which input column to be used as a row key. If the input table contains a composite key, then in such as case the<col> must be in the form of the comma-separated list of the composite key attributes
–hbase-table <table-name> It will Specify an HBase table to be used as a target instead of HDFS
–hbase-bulkload Enables bulk loading

Importing Data Into Accumulo

Apache Sqoop also provides support for importing records into the table in Accumulo. We can import data to Accumulo by specifying –accumulo-table option. Apache Sqoop imports the data to a table specified as an argument to the –accumulo-table.

Each row of an input table is transformed into the Accumulo Mutation operation to the row of an output table. For each row, the key is taken from the column of the input. Sqoop, by default, uses the split-by column as a row key column.

If the split-by column is not specified, then it tries to find the primary key column. We can also manually specify a row key column via –accumulo-row-key.

Each output column has to be placed in the same column family specified with —accumulo-column-family.
Specify the –accumulo-create-table parameter if you have not created a target table.

The Accumulo arguments are:

Argument Description
–accumulo-table <table-nam> It will specify the Accumulo table to be used as the target instead of the HDFS
–accumulo-column-family <family> It will set the target column family for the import
–accumulo-create-table If specified, then it will create the missing Accumulo tables.
–accumulo-row-key <col> It will specify which input column to use as the row key.
–accumulo-visibility <vis> It is Optional.  It will specify the visibility token to be applied to all the rows inserted into Accumulo. The default value is the empty string.
–accumulo-batch-size <size> It is optional. It will set the size of Accumulo’s writer buffer in bytes. The default is 4MB.
–accumulo-max-latency <ms> It is Optional. It will set the maximum latency in the milliseconds for the Accumulo batch writer. The default value is set to 0.
–accumulo-zookeepers <host:port> Specify the comma-separated list of the Zookeeper servers used by Accumulo instance
–accumulo-instance <table-name> Specify the name of a target Accumulo instance
–accumulo-user <username> Specify the name of Accumulo user to import as
–accumulo-password <password> Specify the password for Accumulo user

For performance tuning, we have to use the optional –accumulo-buffer-size\ and –accumulo-max-latency parameters.

Additional Import Configuration Properties

There are some additional properties which we can configure by modifying the conf/sqoop-site.xml file. We can specify the properties in the same manner as we do in Hadoop configuration files.

For example:

<property>
    <name>property.name</name>
    <value>property.value</value>
  </property>

We can also specify it on the command line in the generic arguments.

For example:

sqoop import -D property.name=property.value ...

Example Invocations of Sqoop Import Mainframe

The below examples will illustrate how we can use the Sqoop import tool in a variety of situations.

1: In this example we are trying to import all the sequential files present in a partitioned dataset named EMP in the mainframe host z390:

$ sqoop import-mainframe --connect z390 --dataset EMP \
    --username SomeUser -P
Enter password: (hidden)

2: We are trying to control the import parallelism, that is, setting number of mappers to 8 :

$ sqoop import-mainframe --connect z390 --dataset EMP \
    --username SomeUser --password-file mypassword -m 8

3: In this example, we are trying to import the data to Hive:

$ sqoop import-mainframe --connect z390 --dataset EMP \
    --hive-import

Summary

I hope after reading this Sqoop Import Mainframe article, you clearly understand how we can import all the sequential datasets in a partitioned dataset (PDS) on the mainframe to the HDFS.

The article had also explained how we can import data to Hive, HBase, and Accumulo. We have discussed the Sqoop Import Mainframe tool along with its purpose and syntax.