Sqoop Validation – How Sqoop Validates Copied Data

Sqoop Validation refers to the validation of the data copied. In this Sqoop Validation article, you will explore the entire concept of Sqoop validation in detail. The article first gives a short introduction to Sqoop Validation.

Then it explains the purpose and the Sqoop Validation syntax and configuration. Finally, it will also cover the Sqoop validation interface, examples, and limitations.

 

What is Sqoop Validation?

Sqoop validation means validating the data copied through either import or export. It validates the data by comparing row counts from source as well as from the target post copy.

Its primary purpose is to validate the data copied by comparing the row counts from the source as well as target post copy.

Interfaces of Sqoop Validation

There are three interfaces of Sqoop Validation. They are:

a. ValidationThreshold

This interface determines whether the error margin in between the source and the target are acceptable, that is, Absolute, Percentage Tolerant, etc. The default implementation is the AbsoluteValidationThreshold who ensures that the row counts from the source and the targets are the same.

b. ValidationFailureHandler

This interface is responsible for failure handling, such as log an error or warning, abort, etc. It’s default implementation is the LogOnFailureHandler, which logs a warning message to a configured logger.

c. Validator

This interface drives validation logic by delegating decisions to the ValidationThreshold and delegating the failure handling to the ValidationFailureHandler. It’s default implementation is the RowCountValidator who validates row counts from the source and the target.

Syntax of Sqoop Validation

The Syntax of Sqoop Validation is:

$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)

The validation arguments are part of the import and export arguments.

Sqoop Validation Configuration

The Sqoop validation framework is pluggable and extensible. It comes with the default implementations, but we can extend the interfaces for custom implementations by passing them as the part of command line arguments as explained below.

Validator

Property:                     validator
Description:              Driver for validation, must implement org.apache.sqoop.validation.Validator                                 Supported values:  The value must be a fully qualified class name.
Default value:           org.apache.sqoop.validation.RowCountValidator

Validation Threshold

Property:                    validation-threshold
Description:              IT Drives the decision on the basis of the validation whether meeting the threshold or not. It                                               must implement the org.apache.sqoop.validation.ValidationThreshold
Supported values: The value must be a fully qualified class name.
Default value:          org.apache.sqoop.validation.AbsoluteValidationThreshold

Validation Failure Handler

Property:                    validation-failurehandler
Description:             It is responsible for handling failures. It must implement the org.apache.sqoop.validation.                                                 ValidationFailureHandler
Supported values: The value must be a fully qualified class name.
Default value:         org.apache.sqoop.validation.AbortOnFailureHandler

Limitations of Sqoop Validation

Currently, the Sqoop Validation validates only the data which is copied from a single table into the HDFS. So some of the limitations in the current implementation are:

  • all-tables option
  • free-form query option
  • Data imported into Hive, Accumulo, or HBase.
  • table import with the –where argument
  • incremental imports

Example Invocations of the Sqoop Validation

1: In this example, we are importing a table name emp_info present in the demo_db database that uses the Sqoop Validation for validating the row counts.

$ sqoop import –connect jdbc:mysql://localhost/demo_db  \
–table emp_info –validate

2: In this example, we are trying to export a table named com_info in with the sqoop validation enabled:

$ sqoop export –connect jdbc:mysql://localhost/demo_db –table com_info  \
–export-dir /results/com_info_data –validate

3: In this example, we are overriding the sqoop validation arguments:

$ sqoop import –connect jdbc:mysql://localhost/demo_db –table emp_info \
–validate –validator org.apache.sqoop.validation.RowCountValidator \
–validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
–validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler

Summary

I hope after reading this Sqoop Validation article, you have learned the entire concepts of Sqoop Validation. The article had enlisted various examples to make it easy for you. Now you are aware of the syntax, various interfaces, as well as configuration for Sqoop Validation.