Sqoop Export Files from HDFS to RDBMS

Similar to Sqoop Import, there is another tool named Sqoop Export in Sqoop, which is used for exporting a set of files from the HDFS back to the RDBMS. In this Sqoop Export article, you will explore the entire concept of Sqoop Export.

The article will explain what Sqoop Export is, modes in which Sqoop Export works, its syntax, arguments, and many more. The article also covers the difference between the insert mode and update mode.

Moreover, we will learn the Sqoop Export syntax with an example invocation to understand it better.

Let us first start with an introduction to Sqoop Export.

What is Sqoop Export?

The Sqoop export tool is used for exporting a set of files from the Hadoop Distributed File System back to the RDBMS. For performing export, the target table must exist on the target database.

The files given as an input to Apache Sqoop contain the records, which are called as rows in the table. These files are read and parsed into the set of records and delimited with the user-specified delimiter.
The export command works in two modes- insert mode and update mode.

1. Insert mode: It is the default mode. In this mode, the records from the input files are inserted into the database table by using the INSERT statement.
2. Update mode: In the update mode, Sqoop generates an UPDATE statement that replaces existing records into the database.

Syntax for Sqoop Export

The Syntax for Sqoop Export are:

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)

The Hadoop generic arguments should be passed before any export arguments, and we can enter export arguments in any order with respect to each other.

The common arguments are:

Argument Description
–connect <jdbc-uri> It specifies the JDBC connect string
–connection-manager <class-name> It specifies the connection manager class to be used
–driver <class-name> Manually specify JDBC driver class to use
–hadoop-home <dir> Override $HADOOP_HOME
–help Print usage instructions
-P Read the password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
–connection-param-file <filename> Optional properties file that provides connection parameters

The Export control arguments are:

Argument Description
–direct Use direct export fast path
–export-dir <dir> It specifies the HDFS source path for export
-m,–num-mappers <n> Use n map tasks to export in parallel
–table <table-name> Table to populate
–update-key <col-name> It specifies the anchor column to be used for updates. If there are more than one column, then we use the comma-separated list of columns.
–update-mode <mode> It will specify how the updates were performed when the new rows were found with the non-matching keys in a database. The updateonly and the allowinsert are the legal values for mode.
–input-null-string <null-string> It specify the string which is to be interpreted as null for string columns
–input-null-non-string <null-string> It specify the string which is to be interpreted as null for non-string columns
–staging-table <staging-table-name> It specifies the table in which the data will be staged before getting inserted into a destination table.
–clear-staging-table It indicates that any data which is present in the staging table can be deleted.
–batch It specifies to use the batch mode for the underlying statement execution.
  • The arguments –table and the –export-dir are the required arguments. These two arguments specify the table to be populated in a database and the directory in the HDFS which contains the source data respectively.
  • We can also control the number of mappers by using the –num-mappers or -m arguments.
  • Also, MySQL provides the direct mode for exporting by using the mysqlimport tool. So for exporting to MySQL, we can use –direct argument which specifies this codepath. It has higher-performance than standard JDBC codepath.
  • Apache Sqoop breaks export processes into multiple transactions. Sometimes if the export job failed, then it is possible that the failed export job may result in the partial data being committed to a database. This leads to the subsequent jobs failing due to the insert collisions in some cases. It may also lead to the duplicated data in others. We can overcome this issue by specifying the staging table through the argument –staging-table. This option acts as an auxiliary table which is used for staging the exported data. The staged data is moved to the destination table in the single transaction.
  • For using the staging facility, we have to create the staging table before running the export job. The staging table has to be structurally identical to the target table. The staging table must be either empty before running the export job runs or we have to specify the –clear-staging-table option. In case, if the staging table contains some data and we have specified the –clear-staging-table option then Sqoop will automatically delete all of the table data before executing the export job.

Sqoop Inserts vs. Updates

The Sqoop-export, by default, appends the new rows to the table. Each input record is transformed into the INSERT statement, which adds a row to the target database table.

If our table contains some constraints like the primary key column and already contains the data, then you have to take care to avoid the insertion of those records, which can violate these constraints.

If an INSERT statement fails, then the export process will fail. This mode is primarily intended for exporting the records to the new, empty table, which is intended to receive these results.

If we specify the argument –update-key, then Sqoop will instead modify the existing dataset in a database. Each input record is then treated as the UPDATE statement, which modifies the existing row.

The row which needs to be modified is identified by using the column name(s) specified with the –update-key.
For example, we have the following table definition:

CREATE TABLE emp(
  emp_id NOT NULL PRIMARY KEY, 
emp_name VARCHAR(32),   salary INT);

Also, consider the dataset in the HDFS which contains the records like these:
0,Ajay,40000
1,Raj,10000

On running the command

sqoop-export --table emp --update-key emp_id --export-dir /path/to/data --connect …

will run the export job which executes the SQL statements based on the data like so:

UPDATE emp SET emp_name='Ajay', salary=40000 WHERE emp_id=0;
UPDATE emp SET emp_name='Raj', salary=10000 WHERE emp_id=0;
...

If in case the UPDATE statement modifies no rows, then it is not considered as an error. Instead, the export will silently continue. This means that the update-based export will not insert the new rows into a database).

Also, if the column specified via –-update-key doesn’t uniquely identify the rows and the multiple rows get updated by the single statement, then this condition is also undetected.

Other Arguments

The Input parsing arguments are:

Argument Description
–input-enclosed-by <char> Sets a required field encloser
–input-escaped-by <char> Sets the input escape character
–input-fields-terminated-by <char> Sets the input field separator
–input-lines-terminated-by <char> Sets the input end-of-line character
–input-optionally-enclosed-by <char> Sets a field enclosing character

The Output line formatting arguments are:

Argument Description
–enclosed-by <char> It will set the required field enclosing character
–escaped-by <char> It will set the escape character
–fields-terminated-by <char> It will set the field separator character
–lines-terminated-by <char> It will set the end-of-line character
–mysql-delimiters Uses the MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ‘
–optionally-enclosed-by <char> It will set the field enclosing character

The Code generation arguments are:

Argument Description
–bindir <dir> It will specify the output directory for the compiled objects
–class-name <name> It will set the generated class name. 
–jar-file <file> It will disable the code generation; use specified jar
–outdir <dir> It will specify the output directory for the generated code
–package-name <name> It will put the auto-generated classes in the package specified
–map-column-java <m> Override the default mapping from the SQL type to the Java type for the configured columns.

Sqoop automatically generates the code for parsing and interpreting the records of the files which contain the data which is to be exported back to the database.

If these files are created with the non-default delimiters, then you have to specify the same delimiters again so that Sqoop can parse your files.

In case if you specify the incorrect delimiters, then Sqoop fails to find enough columns per line. This causes export map tasks to fail by throwing the ParseExceptions.

How does Sqoop Export work?

In Sqoop, the exports are performed by multiple writers in parallel. Each writer uses a separate connection with the database. These separate connections will have separate transactions from one another.

Apache Sqoop uses multi-row INSERT syntax for inserting up to 100 records per statement. This ensures that the transaction buffers will not go out-of-bound, and thus does not cause out-of-memory conditions.

Hence, a Sqoop export is not an atomic process. The partial results from export are visible before export is complete.

Sqoop Failed Exports

The Sqoop exports may get failed because of the following reasons:

  • It may fail because of loss of connectivity from a Hadoop cluster to the database that may occur either due to server software crashes or hardware fault.
  • It may fail while attempting to INSERT a row who violates the consistency constraints. For example, when trying to insert the duplicate primary key value.
  • The Export job will fail if we are attempting to parse the incomplete or malformed record from HDFS source data
  • It may fail while attempting to parse the records by using incorrect delimiters
  • Export jobs can fail due to capacity issues like insufficient RAM or disk space.

If the export map task fails due to any of these reasons, then it will result in export job failure. The failed export job results are undefined. Each sqoop export map task operates in the separate transaction.

The individual map tasks commit their current transaction periodically. If the task fails, then the current transaction is rolled back. Any of the previously-committed transactions remains durable in a database, which leads to the partially-complete export.

Sqoop Export Example Invocations

Example 1: In this example, we are trying a basic export to populate the table emp:

$ sqoop export --connect jdbc:mysql://localhost/demo_db --table emp  \
    --export-dir /results/emp_data

This example takes files in the /results/emp_data and injects their contents into the emp table in the demo database on localhost. Make sure that the target table already exists in a database.

Example 2: In this example, we are trying to perform basic export in order to populate a table named emp with the validation enabled:

$ sqoop export --connect jdbc:mysql://localhost/demo --table emp  \
    --export-dir /results/emp_data --validate

Summary

I hope after reading this article, you clearly understand Sqoop Export. The Sqoop export tool is useful for exporting a set of files from the Hadoop Distributed File System back to the RDBMS. The article had clearly explained its syntax, arguments along with an example.