This page describes some basic use cases of the tool.
PLEASE NOTE: In below commands, my_bq_conn refers to the connection name for your BigQuery project. We are validating BigQuery tables that are available in BigQuery public datasets. These examples validate a table against itself for example purposes.
Also, note that if no aggregation flag is provided, the tool will run a 'COUNT *' as the default aggregation.
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips,bigquery-public-data.new_york_citibike.citibike_stations
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
-c citibike.yaml
Above command creates a YAML file named citibike.yaml that can be used to run validations in the future.
Please note: When the config-file (-c) option is provided, only the YAML file gets created. The validation doesn’t execute.
data-validation configs run \
-c citibike.yaml
Above command executes validations stored in a config file named citibike.yaml.
data-validation generate-table-partitions \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_trees.tree_census_2015 \
--primary-keys tree_id \
--hash '*' \
--filters 'tree_id>3000' \
--config-dir partitions_dir \
--partition-num 200
Above command creates multiple partitions based on the primary key. Number of generated configuration files is decided by --partition-num
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--count '*'
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--count bikeid,gender
data-validation validate row \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--primary-keys station_id \
--hash '*'
For Oracle and Postgres connections, numeric primary keys may not match between source and target data types due to precision or scale discrepancies. You may need to cast the primary key to a string for a valid match, which can be done by updating cast: null
to cast: string
in yaml file.
primary_keys:
- cast: string
field_alias: dept_no
source_column: dept_no
target_column: dept_no
data-validation validate row \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--filter-status fail \
--primary-keys station_id \
--hash '*'
data-validation validate row \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--primary-keys station_id \
-comp-fields name \
-rr \
-rbs 100
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--count tripduration,start_station_name \
-bqrh $YOUR_PROJECT_ID.pso_data_validator.results
Please replace $YOUR_PROJECT_ID with the correct project-id where you created your results datasets as mentioned in the installation section.
SELECT
run_id,
validation_name,
validation_type,
group_by_columns,
source_table_name,
source_agg_value,
target_agg_value,
pct_difference,
pct_threshold,
difference,
start_time
FROM
`pso_data_validator.results`
ORDER BY
start_time DESC
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--grouped-columns bikeid
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--grouped-columns bikeid,usertype
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--sum num_bikes_available
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--sum num_bikes_available,num_docks_available
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--sum num_bikes_available,num_docks_available \
--avg num_bikes_disabled,num_docks_disabled
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--grouped-columns region_id \
--sum num_bikes_available,num_docks_available \
--avg num_bikes_disabled,num_docks_disabled
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_stations \
--grouped-columns region_id \
--sum num_bikes_available,num_docks_available \
--filters 'region_id=71' \
-bqrh $YOUR_PROJECT_ID.pso_data_validator.results
data-validation validate column \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
--count tripduration,start_station_name \
-l tag=test-run,owner=name
data-validation validate schema \
-sc my_bq_conn \
-tc my_bq_conn \
-tbls bigquery-public-data.new_york_citibike.citibike_trips \
-bqrh $YOUR_PROJECT_ID.pso_data_validator.results
# Additional dependency needed for GCS files
pip install gcsfs
data-validation connections add \
--connection-name file_conn FileSystem \
--table-name $FILE_NAME \
--file-path gs://path/to/file \
--file-type csv
data-validation connections add \
--connection-name my_bq_conn BigQuery \
--project-id $YOUR_PROJECT_ID
# Validate GCS CSV file with BigQuery table
data-validation validate column \
-sc file_conn \
-tc my_bq_conn \
-tbls $FILE_NAME=$YOUR_PROJECT_ID.dataset.table \
--count $COLUMN
data-validation query
--conn connection-name The named connection to be queried.
--query, -q The Raw query to run against the supplied connection
result_handler:
project_id: my-project-id
table_id: pso_data_validator.results
type: BigQuery
google_service_account_key_path: path/to/sa.json
source: my_bq_conn
target: my_bq_conn
validations:
- aggregates:
- field_alias: count
source_column: null
target_column: null
type: count
calculated_fields: []
filters:
- source: region_id=71
target: region_id=71
type: custom
format: table
grouped_columns:
- cast: null
field_alias: region_id
source_column: region_id
target_column: region_id
labels:
- !!python/tuple
- user
- test
random_row_batch_size: null
schema_name: bigquery-public-data.new_york_citibike
table_name: citibike_stations
target_schema_name: bigquery-public-data.new_york_citibike
target_table_name: citibike_stations
threshold: 0.0
type: Column
use_random_rows: false
The NUMERIC data type in BigQuery is equivalent to DECIMAL(38,9). This configuration will run a SUM(CAST(column to NUMERIC)) to avoid integer overflow.
result_handler: {}
source: my_bq_conn
target: my_bq_conn
validations:
- aggregates:
- field_alias: count
source_column: null
target_column: null
type: count
- field_alias: sum__int
source_column: cast__int
target_column: cast__int
type: sum
calculated_fields:
- depth: 0
field_alias: cast__int
source_calculated_columns:
- int
target_calculated_columns:
- int
type: cast
default_cast: decimal(38,9)
filters: []
format: table
labels: []
random_row_batch_size: null
schema_name: project.dataset
table_name: my_table
target_schema_name: project.dataset
target_table_name: my_table
threshold: 0.0
type: Column
use_random_rows: false
This is a comparison field validation where DVT will first apply the calculated field and then value compare the result.
result_handler: {}
source: my_bq_conn
target: my_bq_conn
validations:
- calculated_fields:
- depth: 0
field_alias: replace_name
source_calculated_columns:
- name
target_calculated_columns:
- name
type: custom
ibis_expr: ibis.expr.types.StringValue.replace
params:
- pattern: '/'
- replacement: '-'
comparison_fields:
- cast: null
field_alias: replace_name
source_column: replace_name
target_column: replace_name
filter_status: null
filters: []
format: table
labels: []
primary_keys:
- cast: null
field_alias: station_id
source_column: station_id
target_column: station_id
random_row_batch_size: '5'
schema_name: bigquery-public-data.new_york_citibike
table_name: citibike_stations
target_schema_name: bigquery-public-data.new_york_citibike
target_table_name: citibike_stations
threshold: 0.0
type: Row
use_random_rows: true
data-validation validate custom-query column \
--source-query-file source_query.sql \
--target-query-file target_query.sql \
-sc my_bq_conn \
-tc my_bq_conn
data-validation validate custom-query column \
--source-query-file source_query.sql \
--target-query-file target_query.sql \
-sc my_bq_conn \
-tc my_bq_conn \
--sum num_bikes_available
data-validation validate custom-query column \
--source-query-file source_query.sql \
--target-query-file target_query.sql \
-sc my_bq_conn \
-tc my_bq_conn \
--max num_bikes_available
data-validation validate custom-query column \
-sc my_bq_conn \
-tc my_bq_conn \
--source-query 'select bikeid, gender from bigquery-public-data.new_york_citibike.citibike_trips' \
--target-query 'select bikeid, gender from bigquery-public-data.new_york_citibike.citibike_trips' \
--count bikeid,gender
data-validation validate custom-query row \
--source-query-file source_query.sql \
--target-query-file target_query.sql \
-sc my_bq_conn \
-tc my_bq_conn \
--hash '*' \
--primary-keys station_id
Please replace source_query.sql and target_query.sql with the correct files containing sql query for source and target database respectively. The primary key should be included in the SELECT statement of both source_query.sql and target_query.sql