|
| 1 | +--- |
| 2 | +title: Data Validation and Automation |
| 3 | +jupyter: python3 |
| 4 | +toc-expand: 2 |
| 5 | +--- |
| 6 | + |
| 7 | +Pointblank’s CLI makes it easy to validate your data directly from the terminal. This is ideal for |
| 8 | +quick checks, CI/CD pipelines, and automation workflows. The `pb run` command serves as a runner for |
| 9 | +validation scripts written with the Pointblank Python API, allowing you to execute more complex |
| 10 | +validation logic from the command line. |
| 11 | + |
| 12 | +## Supported Data Sources |
| 13 | + |
| 14 | +You can validate a wide variety of data sources using the CLI: |
| 15 | + |
| 16 | +- CSV files: single files, glob patterns |
| 17 | +- Parquet files: single files, directories, partitioned datasets |
| 18 | +- GitHub URLs: for CSV/Parquet files (standard or raw URLs) |
| 19 | +- database tables: via connection strings |
| 20 | +- built-in datasets: provided by Pointblank |
| 21 | + |
| 22 | +::: {.callout-info} |
| 23 | + |
| 24 | +## Quick Reference for the Data Validation Commands |
| 25 | + |
| 26 | +| Command | Purpose | |
| 27 | +|-----------------|------------------------------------------------------| |
| 28 | +| `pb validate` | Run validation checks on your data | |
| 29 | +| `pb run` | Run a Python validation script from the CLI | |
| 30 | +| `pb make-template` | Generate a validation script template | |
| 31 | + |
| 32 | +::: |
| 33 | + |
| 34 | +## `pb validate`: Run Validation Checks |
| 35 | + |
| 36 | +Use `pb validate` to perform one or more validation checks on your data source. Here's the basic |
| 37 | +usage pattern: |
| 38 | + |
| 39 | +```bash |
| 40 | +pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS] |
| 41 | +``` |
| 42 | + |
| 43 | +Here are the supported checks and the required options in parentheses: |
| 44 | + |
| 45 | +- `rows-distinct`: check for duplicate rows (default) |
| 46 | +- `rows-complete`: check for missing values in any column |
| 47 | +- `col-exists`: check if a column exists (`--column`) |
| 48 | +- `col-vals-not-null`: check if a column has no null values (`--column`) |
| 49 | +- `col-vals-gt`: column values greater than a value (`--column`, `--value`) |
| 50 | +- `col-vals-ge`: column values greater than or equal to a value (`--column`, `--value`) |
| 51 | +- `col-vals-lt`: column values less than a value (`--column`, `--value`) |
| 52 | +- `col-vals-le`: column values less than or equal to a value (`--column`, `--value`) |
| 53 | +- `col-vals-in-set`: column values must be in a set (`--column`, `--set`) |
| 54 | + |
| 55 | +Here are a few examples: |
| 56 | + |
| 57 | +```bash |
| 58 | +# Check for duplicate rows (default) |
| 59 | +pb validate data.csv |
| 60 | + |
| 61 | +# Check if all values in 'age' are not null |
| 62 | +pb validate data.csv --check col-vals-not-null --column age |
| 63 | + |
| 64 | +# Check if all values in 'score' are greater than 50 |
| 65 | +pb validate data.csv --check col-vals-gt --column score --value 50 |
| 66 | + |
| 67 | +# Check if 'status' values are in a set |
| 68 | +pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending" |
| 69 | +``` |
| 70 | + |
| 71 | +Several consecutive checks can be performed in one command! To do this, use `--check` multiple times |
| 72 | +(along with the check type and its required options). |
| 73 | + |
| 74 | +```bash |
| 75 | +# Perform multiple checks in one command |
| 76 | +pb validate data.csv --check rows-distinct --check col-vals-not-null --column id |
| 77 | +``` |
| 78 | + |
| 79 | +There are several useful options: |
| 80 | + |
| 81 | +- `--show-extract`: show failing rows if validation fails |
| 82 | +- `--write-extract TEXT: save failing rows to a directory as CSV |
| 83 | +- `--limit INTEGER`: limit the number of failing rows shown/saved (default: 10) |
| 84 | +- `--exit-code`: exit with non-zero code if validation fails (for CI/CD) |
| 85 | +- `--list-checks`: list all available validation checks |
| 86 | + |
| 87 | +## `pb run`: Run Python Validation Scripts |
| 88 | + |
| 89 | +Use `pb run` to execute a Python script containing Pointblank validation logic. This is useful for |
| 90 | +more complex validations or automation. |
| 91 | + |
| 92 | +```bash |
| 93 | +pb run validation_script.py |
| 94 | +``` |
| 95 | + |
| 96 | +Here are the options: |
| 97 | + |
| 98 | +- `--data [DATA_SOURCE]`: override the data source in your script (available as `cli_data`) |
| 99 | +- `--output-html file.html`: save validation report as HTML |
| 100 | +- `--output-json file.json`: save validation summary as JSON |
| 101 | +- `--show-extract`: show failing rows for failed steps |
| 102 | +- `--write-extract TEXT`: save failing rows for each step as CSVs in a directory |
| 103 | +- `--limit INTEGER`: limit the number of failing rows shown/saved |
| 104 | +- `--fail-on [critical|error|warning|any]`: exit with error if any step meets/exceeds this severity |
| 105 | + |
| 106 | +Here's an example where we: |
| 107 | + |
| 108 | +- override the input data in the script |
| 109 | +- output the validation report table to a file |
| 110 | +- signal failure (i.e., exit with non-zero code) if any 'error' threshold is met |
| 111 | + |
| 112 | +```bash |
| 113 | +pb run my_validation.py --data data.csv --output-html report.html --fail-on error |
| 114 | +``` |
| 115 | + |
| 116 | +To scaffold a .py file for this, use `pb make-template`. |
| 117 | + |
| 118 | +## `pb make-template`: Generate a Validation Script Template |
| 119 | + |
| 120 | +Use this command to create a starter Python script for Pointblank validation: |
| 121 | + |
| 122 | +```bash |
| 123 | +pb make-template my_validation.py |
| 124 | +``` |
| 125 | + |
| 126 | +Edit the generated script to add your own data loading and validation rules, then run it with |
| 127 | +`pb run`. |
| 128 | + |
| 129 | +## Integration with CI/CD |
| 130 | + |
| 131 | +Validation through the CLI provide opportunities for automation. The following features lend |
| 132 | +themselves well to automated processes: |
| 133 | + |
| 134 | +- validation results are shown in a clear, color-coded table |
| 135 | +- `--show-extract` or `--write-extract` can be used to debug failing rows |
| 136 | +- the `--exit-code` or `--fail-on` options are ideal for CI/CD integration. |
| 137 | + |
| 138 | + |
| 139 | +Here's an example of how one might integrate data validation through the Pointblank CLI into a CI/CD |
| 140 | +pipeline: |
| 141 | + |
| 142 | +```yaml |
| 143 | +# Example GitHub Actions workflow |
| 144 | +name: Data Validation |
| 145 | +on: [push, pull_request] |
| 146 | + |
| 147 | +jobs: |
| 148 | + validate: |
| 149 | + runs-on: ubuntu-latest |
| 150 | + steps: |
| 151 | + - uses: actions/checkout@v2 |
| 152 | + - name: Set up Python |
| 153 | + uses: actions/setup-python@v2 |
| 154 | + with: |
| 155 | + python-version: '3.9' |
| 156 | + - name: Install dependencies |
| 157 | + run: pip install pointblank |
| 158 | + - name: Validate data quality |
| 159 | + run: | |
| 160 | + pb validate data/sales.csv --check rows-distinct --exit-code |
| 161 | + pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code |
| 162 | + pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code |
| 163 | +``` |
| 164 | +
|
| 165 | +## Some Useful Tips |
| 166 | +
|
| 167 | +- use `pb validate --list-checks` to see all available checks and usage examples. |
| 168 | +- use `pb run` for advanced validation logic or when you need to chain multiple steps. |
| 169 | +- use `pb make-template` to quickly scaffold new validation scripts. |
| 170 | + |
| 171 | +For more CLI usage examples and real terminal output, see the |
| 172 | +[CLI Demos](../demos/cli-interactive/index.qmd). |
0 commit comments