Site icon TechVidvan

Sqoop Merge Tool to Combine Datasets

Sqoop Merge tool is for combining the two datasets. There are many other insights available for the Sqoop Merge. The article will explain what Sqoop Merge is, its syntax, arguments, and much more.

So, let’s start.

What is Sqoop Merge?

Sqoop Merge is a tool that allows us to combine two datasets. The entries of one dataset override the entries of the older dataset.

It is useful for efficiently transferring the vast volume of data between Hadoop and structured data stores like relational databases. After performing the merge operation, we can import the data back into the Apache Hive or HBase.

In short, the Sqoop merge tool “flatten” the two datasets into one, by taking the newest available records for each primary key.

Sqoop Merge Syntax

$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)

The Hadoop generic arguments should be passed before any merge arguments. However, we can enter job arguments in any order with respect to each other.

Merge Arguments and How Sqoop Merge works?

The merge arguments are:

Argument Description
–class-name <class> It Specifies the name of the record-specific class to be used during the merge job.
–jar-file <file> This specifies the name of the jar from which the record class is to be loaded.
–merge-key <col> It specifies the name of the column to be used as the merge key.
–new-data <path> It will specify the path of the  newer dataset.
–onto <path> It will specify the path of the older dataset.
–target-dir <path> It will specify the target path for the output of the merge job.

Example Invocations

Suppose we had performed the two incremental imports where some older data is in the HDFS directory named older, and the newer data is also in the HDFS directory named newer. Then we can merge this as:

$ sqoop merge --new-data newer --onto older --target-dir merged \
    --jar-file datatypes.jar --class-name Foo --merge-key id

This command runs the MapReduce job where the value in the id column of each row is used for joining rows. The rows in a newer dataset are used in preference to the rows in an older dataset.

Summary

I hope after reading this article, you have clearly understood the entire concept of the Sqoop Merge tool. This tool is useful for combining the two datasets.

The entries of one dataset overrides the entries of the older dataset. The article had explained the Sqoop Merge tool syntax, arguments, as well as its working.

Exit mobile version