Sqoop Merge Tool to Combine Datasets

Tech Vidvan

4 years ago

Sqoop Merge tool is for combining the two datasets. There are many other insights available for the Sqoop Merge. The article will explain what Sqoop Merge is, its syntax, arguments, and much more.

So, let’s start.

What is Sqoop Merge?

Sqoop Merge is a tool that allows us to combine two datasets. The entries of one dataset override the entries of the older dataset.

It is useful for efficiently transferring the vast volume of data between Hadoop and structured data stores like relational databases. After performing the merge operation, we can import the data back into the Apache Hive or HBase.

In short, the Sqoop merge tool “flatten” the two datasets into one, by taking the newest available records for each primary key.

Sqoop Merge Syntax

$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)

The Hadoop generic arguments should be passed before any merge arguments. However, we can enter job arguments in any order with respect to each other.

Merge Arguments and How Sqoop Merge works?

The merge arguments are:

Argument	Description
–class-name <class>	It Specifies the name of the record-specific class to be used during the merge job.
–jar-file <file>	This specifies the name of the jar from which the record class is to be loaded.
–merge-key <col>	It specifies the name of the column to be used as the merge key.
–new-data <path>	It will specify the path of the newer dataset.
–onto <path>	It will specify the path of the older dataset.
–target-dir <path>	It will specify the target path for the output of the merge job.

The Sqoop merge tool runs the MapReduce job, which takes two directories as an input. One is the newer dataset, and the other is the older dataset. These two directories are specified with the –new-data and –onto respectively.
The output generated by this MapReduce job is placed in the directory in HDFS, which is specified by the –target-dir.
While merging the datasets, it is assumed that there is a unique primary key value in each record.
We specify the primary key column by –merge-key argument. More than one row in the same dataset must not have the same primary key. Otherwise, data may be lost.
For parsing the dataset and extracting the key column, the auto-generated class from the previous import must be used.
We can specify the class name and the jar file with the arguments –class-name and –jar-file. If it is not available, then we can recreate this class by using the Sqoop Codegen tool.
The Sqoop merge tool typically runs after the incremental import with the date-last-modified mode, that is, (sqoop import —incremental lastmodified …).

Example Invocations

Suppose we had performed the two incremental imports where some older data is in the HDFS directory named older, and the newer data is also in the HDFS directory named newer. Then we can merge this as:

$ sqoop merge --new-data newer --onto older --target-dir merged \
    --jar-file datatypes.jar --class-name Foo --merge-key id

This command runs the MapReduce job where the value in the id column of each row is used for joining rows. The rows in a newer dataset are used in preference to the rows in an older dataset.

Summary

I hope after reading this article, you have clearly understood the entire concept of the Sqoop Merge tool. This tool is useful for combining the two datasets.

The entries of one dataset overrides the entries of the older dataset. The article had explained the Sqoop Merge tool syntax, arguments, as well as its working.