Sqoop Validation – How Sqoop Validates Copied Data
Sqoop Validation refers to the validation of the data copied. In this Sqoop Validation article, you will explore the entire concept of Sqoop validation in detail. The article first gives a short introduction to Sqoop Validation.
Then it explains the purpose and the Sqoop Validation syntax and configuration. Finally, it will also cover the Sqoop validation interface, examples, and limitations.
What is Sqoop Validation?
Sqoop validation means validating the data copied through either import or export. It validates the data by comparing row counts from source as well as from the target post copy.
Its primary purpose is to validate the data copied by comparing the row counts from the source as well as target post copy.
Interfaces of Sqoop Validation
There are three interfaces of Sqoop Validation. They are:
a. ValidationThreshold
This interface determines whether the error margin in between the source and the target are acceptable, that is, Absolute, Percentage Tolerant, etc. The default implementation is the AbsoluteValidationThreshold who ensures that the row counts from the source and the targets are the same.
b. ValidationFailureHandler
This interface is responsible for failure handling, such as log an error or warning, abort, etc. It’s default implementation is the LogOnFailureHandler, which logs a warning message to a configured logger.
c. Validator
This interface drives validation logic by delegating decisions to the ValidationThreshold and delegating the failure handling to the ValidationFailureHandler. It’s default implementation is the RowCountValidator who validates row counts from the source and the target.
Syntax of Sqoop Validation
The Syntax of Sqoop Validation is:
$ sqoop import (generic-args) (import-args) $ sqoop export (generic-args) (export-args)
The validation arguments are part of the import and export arguments.
Sqoop Validation Configuration
The Sqoop validation framework is pluggable and extensible. It comes with the default implementations, but we can extend the interfaces for custom implementations by passing them as the part of command line arguments as explained below.
Validator
Property: validator
Description: Driver for validation, must implement org.apache.sqoop.validation.Validator Supported values: The value must be a fully qualified class name.
Default value: org.apache.sqoop.validation.RowCountValidator
Validation Threshold
Property: validation-threshold
Description: IT Drives the decision on the basis of the validation whether meeting the threshold or not. It must implement the org.apache.sqoop.validation.ValidationThreshold
Supported values: The value must be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler
Property: validation-failurehandler
Description: It is responsible for handling failures. It must implement the org.apache.sqoop.validation. ValidationFailureHandler
Supported values: The value must be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbortOnFailureHandler
Limitations of Sqoop Validation
Currently, the Sqoop Validation validates only the data which is copied from a single table into the HDFS. So some of the limitations in the current implementation are:
- all-tables option
- free-form query option
- Data imported into Hive, Accumulo, or HBase.
- table import with the –where argument
- incremental imports
Example Invocations of the Sqoop Validation
1: In this example, we are importing a table name emp_info present in the demo_db database that uses the Sqoop Validation for validating the row counts.
$ sqoop import –connect jdbc:mysql://localhost/demo_db \ –table emp_info –validate
2: In this example, we are trying to export a table named com_info in with the sqoop validation enabled:
$ sqoop export –connect jdbc:mysql://localhost/demo_db –table com_info \ –export-dir /results/com_info_data –validate
3: In this example, we are overriding the sqoop validation arguments:
$ sqoop import –connect jdbc:mysql://localhost/demo_db –table emp_info \ –validate –validator org.apache.sqoop.validation.RowCountValidator \ –validation-threshold \ org.apache.sqoop.validation.AbsoluteValidationThreshold \ –validation-failurehandler \ org.apache.sqoop.validation.AbortOnFailureHandler
Summary
I hope after reading this Sqoop Validation article, you have learned the entire concepts of Sqoop Validation. The article had enlisted various examples to make it easy for you. Now you are aware of the syntax, various interfaces, as well as configuration for Sqoop Validation.