Relational Databases Supported by Sqoop

Apache Sqoop transfers data between the relational Databases (which has JDC connectivity) and the Hadoop ecosystem (HDFS, HBase, Hive). In this article, you will explore different types of relational Databases supported by Apache Sqoop.

You will also learn about the specific versions of databases supported by Sqoop. The article first gives a short introduction to Apache Sqoop and then enlist the Sqoop Supported Databases.

Sqoop supported databases

What is Apache Sqoop?

For analyzing data, there may arise a situation where a lot of data has to be transferred from RDBMS into Hadoop. For this there was a need for a tool that can perform this task fast and efficiently.

This is where the Apache Sqoop came into the picture. Apache Sqoop is a tool that is mainly designed for transferring bulk data efficiently between Apache Hadoop and the structured datastores such as RDBMS (MySQL, Oracle, and some more).

It is a part of a Hadoop ecosystem. Sqoop is now extensively used for data transferring between RDBMS to the Hadoop ecosystem for data processing and analysis.

Sqoop Supported Databases

Apache Sqoop uses JDBC for connecting to the databases and complying with the published standards as much as possible. For the databases which don’t provide support for the standards-compliant SQL, Apache Sqoop uses alternate codepaths to provide the functionality.

Even though the JDBC is the compatibility layer that allows the program to access different databases through the common API. But due to the minor differences in the SQL language of each database, we cannot use Sqoop with every database out of the box.

In general, Apache Sqoop is admitted to be compatible with a large number of databases, but it has been tested with a few databases.

When we provide a connect string to Apache Sqoop, Sqoop inspects the protocol scheme in order to determine the appropriate vendor-specific logic to use.

If Apache Sqoop knows about the given database, then it will work automatically.

If Apache Sqoop doesn’t know about the given database then we have to specify the driver class to load through the –driver argument. This will use the generic code path that will use standard SQL for accessing the database.

Apache Sqoop provides some databases with faster, non-JDBC-based access mechanisms. We can enable this by specifying the –direct parameter.

Databases Supported by Apache Sqoop

Apache Sqoop includes the vendor-specific support for the following databases:

Database version –direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb:*//
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes (import only) jdbc:postgresql://
CUBRID 9.2+ NO jdbc:cubrid:*

Apache Sqoop may work with the older versions of the databases listed above, but officially they have tested Sqoop only with the versions specified above.

Note:

  • We have to install the database vendor JDBC driver in our $SQOOP_HOME/lib path on our client even when Sqoop supports the database internally.
  • Sqoop can load the classes from any jars in the $SQOOP_HOME/lib on the client and use them as part of any of the MapReduce jobs it runs.
  • We don’t have to install the JDBC jars in the Hadoop library path on our server like we do in the older versions.

Summary

The article had enlisted all the Databases supported by Apache Sqoop. I hope that after reading this article you are aware of the specific versions of the databases tested with Apache Sqoop.

If you have any query related to Sqoop Databases Support, please share it with us in the comment box. We will get back to you as soon as possible.