Skip to content

Commit a96b468

Browse files
authored
Merge pull request #235 from posit-dev/docs-cli-user-guide-articles
docs: create two API utility articles
2 parents 483d8ab + 367dfe1 commit a96b468

File tree

4 files changed

+297
-306
lines changed

4 files changed

+297
-306
lines changed

docs/_quarto.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,9 +89,10 @@ website:
8989
- user-guide/preview.qmd
9090
- user-guide/col-summary-tbl.qmd
9191
- user-guide/missing-vals-tbl.qmd
92-
- section: "CLI Utility"
92+
- section: "The Pointblank CLI"
9393
contents:
94-
- user-guide/cli.qmd
94+
- user-guide/cli-data-inspection.qmd
95+
- user-guide/cli-data-validation.qmd
9596

9697
page-footer:
9798
left: 'Proudly supported by <a href="https://www.posit.co/" class="no-icon"><img src="/assets/posit-logo-black.svg" alt="Posit" width="80" style="padding-left: 3px;vertical-align:text-top;"></a>'
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: Data Inspection and Exploration
3+
jupyter: python3
4+
toc-expand: 2
5+
---
6+
7+
Pointblank’s CLI makes it easy to inspect, preview, and explore your data before running
8+
validations. This is useful for understanding your data’s structure, checking for obvious issues,
9+
and confirming that your data source is being read correctly.
10+
11+
## Supported Data Sources
12+
13+
You can inspect a wide variety of data sources using the CLI:
14+
15+
- CSV files: single files, glob patterns
16+
- Parquet files: as single files, directories, or partitioned datasets
17+
- GitHub URLs: for CSV/Parquet files as standard or raw URLs
18+
- database tables: via connection strings
19+
- built-in datasets: these are provided by Pointblank
20+
21+
::: {.callout-info}
22+
23+
## Quick Reference for the Data Inspection Commands
24+
25+
| Command | Purpose |
26+
|-----------------|----------------------------------------------|
27+
| `pb info` | Show table type, dimensions, columns, types |
28+
| `pb preview` | Preview head/tail rows, select columns |
29+
| `pb scan` | Full column summary/profile (stats, NA, etc) |
30+
| `pb missing` | Visualize missing value patterns |
31+
32+
:::
33+
34+
## `pb info`: Inspecting Data Structure
35+
36+
Use `pb info` to display basic information about your data source:
37+
38+
```bash
39+
pb info data.csv
40+
pb info "data/*.parquet"
41+
pb info "duckdb:///warehouse/analytics.ddb::customer_metrics"
42+
pb info small_table
43+
```
44+
45+
This command shows the
46+
47+
- table type (e.g., pandas, polars, etc.)
48+
- number of rows and columns
49+
- data source path or identifier
50+
51+
## `pb preview`: Previewing Data
52+
53+
Use `pb preview` to view the first and last rows of your data, with flexible column selection:
54+
55+
```bash
56+
pb preview data.csv
57+
pb preview "data/*.parquet"
58+
pb preview "https://github.com/user/repo/blob/main/data.csv"
59+
pb preview "duckdb:///path/to/db.ddb::table_name"
60+
pb preview small_table
61+
```
62+
63+
Here are some useful options:
64+
65+
- `--rows N`: show N rows from the top, default: 5
66+
- `--columns "col1,col2"`: show only specified columns
67+
- `--col-range "1:10"`: show columns by position
68+
- `--col-first N`: show first N columns
69+
- `--col-last N`: show last N columns
70+
- `--no-row-numbers`: hide row numbers
71+
- `--output-html file.html`: save preview as an HTML file
72+
73+
Here's an example where only the `name`, `age`, and `email` columns from `data.csv` are shown (and
74+
we limit this to the top 10 rows):
75+
76+
```bash
77+
pb preview data.csv --columns "name,age,email" --rows 10
78+
```
79+
80+
## `pb scan`: Column Summary and Profiling
81+
82+
Use `pb scan` for a comprehensive column summary, including:
83+
84+
- data types
85+
- missing value counts
86+
- unique value counts
87+
- summary statistics (mean, standard deviation, min, max, quartiles)
88+
89+
```bash
90+
pb scan data.csv
91+
pb scan "data/*.parquet"
92+
pb scan "duckdb:///warehouse/analytics.ddb::customer_metrics"
93+
pb scan small_table
94+
```
95+
96+
Here are the options:
97+
98+
- `--columns "col1,col2"` (scan only specified columns)
99+
- `--output-html file.html` (save scan as HTML report)
100+
101+
## `pb missing`: Missing Value Patterns
102+
103+
Use `pb missing` to generate a missing values report, visualizing missingness across columns and row
104+
sectors:
105+
106+
```bash
107+
pb missing data.csv
108+
pb missing "data/*.parquet"
109+
pb missing "duckdb:///warehouse/analytics.ddb::customer_metrics"
110+
pb missing small_table
111+
```
112+
113+
There's an option here as well:
114+
115+
- `--output-html file.html` (save missing values report as HTML)
116+
117+
## Some Useful Tips on When and How to Use
118+
119+
- use `pb info` and before running validations to confirm your data source can be loaded.
120+
- use `pb preview` to quickly understand what the data looks like.
121+
- use `pb missing` to visualize and diagnose missing data patterns.
122+
- use `pb scan` for a quick data profile and to spot outliers or data quality issues.
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
title: Data Validation and Automation
3+
jupyter: python3
4+
toc-expand: 2
5+
---
6+
7+
Pointblank’s CLI makes it easy to validate your data directly from the terminal. This is ideal for
8+
quick checks, CI/CD pipelines, and automation workflows. The `pb run` command serves as a runner for
9+
validation scripts written with the Pointblank Python API, allowing you to execute more complex
10+
validation logic from the command line.
11+
12+
## Supported Data Sources
13+
14+
You can validate a wide variety of data sources using the CLI:
15+
16+
- CSV files: single files, glob patterns
17+
- Parquet files: single files, directories, partitioned datasets
18+
- GitHub URLs: for CSV/Parquet files (standard or raw URLs)
19+
- database tables: via connection strings
20+
- built-in datasets: provided by Pointblank
21+
22+
::: {.callout-info}
23+
24+
## Quick Reference for the Data Validation Commands
25+
26+
| Command | Purpose |
27+
|-----------------|------------------------------------------------------|
28+
| `pb validate` | Run validation checks on your data |
29+
| `pb run` | Run a Python validation script from the CLI |
30+
| `pb make-template` | Generate a validation script template |
31+
32+
:::
33+
34+
## `pb validate`: Run Validation Checks
35+
36+
Use `pb validate` to perform one or more validation checks on your data source. Here's the basic
37+
usage pattern:
38+
39+
```bash
40+
pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS]
41+
```
42+
43+
Here are the supported checks and the required options in parentheses:
44+
45+
- `rows-distinct`: check for duplicate rows (default)
46+
- `rows-complete`: check for missing values in any column
47+
- `col-exists`: check if a column exists (`--column`)
48+
- `col-vals-not-null`: check if a column has no null values (`--column`)
49+
- `col-vals-gt`: column values greater than a value (`--column`, `--value`)
50+
- `col-vals-ge`: column values greater than or equal to a value (`--column`, `--value`)
51+
- `col-vals-lt`: column values less than a value (`--column`, `--value`)
52+
- `col-vals-le`: column values less than or equal to a value (`--column`, `--value`)
53+
- `col-vals-in-set`: column values must be in a set (`--column`, `--set`)
54+
55+
Here are a few examples:
56+
57+
```bash
58+
# Check for duplicate rows (default)
59+
pb validate data.csv
60+
61+
# Check if all values in 'age' are not null
62+
pb validate data.csv --check col-vals-not-null --column age
63+
64+
# Check if all values in 'score' are greater than 50
65+
pb validate data.csv --check col-vals-gt --column score --value 50
66+
67+
# Check if 'status' values are in a set
68+
pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending"
69+
```
70+
71+
Several consecutive checks can be performed in one command! To do this, use `--check` multiple times
72+
(along with the check type and its required options).
73+
74+
```bash
75+
# Perform multiple checks in one command
76+
pb validate data.csv --check rows-distinct --check col-vals-not-null --column id
77+
```
78+
79+
There are several useful options:
80+
81+
- `--show-extract`: show failing rows if validation fails
82+
- `--write-extract TEXT: save failing rows to a directory as CSV
83+
- `--limit INTEGER`: limit the number of failing rows shown/saved (default: 10)
84+
- `--exit-code`: exit with non-zero code if validation fails (for CI/CD)
85+
- `--list-checks`: list all available validation checks
86+
87+
## `pb run`: Run Python Validation Scripts
88+
89+
Use `pb run` to execute a Python script containing Pointblank validation logic. This is useful for
90+
more complex validations or automation.
91+
92+
```bash
93+
pb run validation_script.py
94+
```
95+
96+
Here are the options:
97+
98+
- `--data [DATA_SOURCE]`: override the data source in your script (available as `cli_data`)
99+
- `--output-html file.html`: save validation report as HTML
100+
- `--output-json file.json`: save validation summary as JSON
101+
- `--show-extract`: show failing rows for failed steps
102+
- `--write-extract TEXT`: save failing rows for each step as CSVs in a directory
103+
- `--limit INTEGER`: limit the number of failing rows shown/saved
104+
- `--fail-on [critical|error|warning|any]`: exit with error if any step meets/exceeds this severity
105+
106+
Here's an example where we:
107+
108+
- override the input data in the script
109+
- output the validation report table to a file
110+
- signal failure (i.e., exit with non-zero code) if any 'error' threshold is met
111+
112+
```bash
113+
pb run my_validation.py --data data.csv --output-html report.html --fail-on error
114+
```
115+
116+
To scaffold a .py file for this, use `pb make-template`.
117+
118+
## `pb make-template`: Generate a Validation Script Template
119+
120+
Use this command to create a starter Python script for Pointblank validation:
121+
122+
```bash
123+
pb make-template my_validation.py
124+
```
125+
126+
Edit the generated script to add your own data loading and validation rules, then run it with
127+
`pb run`.
128+
129+
## Integration with CI/CD
130+
131+
Validation through the CLI provide opportunities for automation. The following features lend
132+
themselves well to automated processes:
133+
134+
- validation results are shown in a clear, color-coded table
135+
- `--show-extract` or `--write-extract` can be used to debug failing rows
136+
- the `--exit-code` or `--fail-on` options are ideal for CI/CD integration.
137+
138+
139+
Here's an example of how one might integrate data validation through the Pointblank CLI into a CI/CD
140+
pipeline:
141+
142+
```yaml
143+
# Example GitHub Actions workflow
144+
name: Data Validation
145+
on: [push, pull_request]
146+
147+
jobs:
148+
validate:
149+
runs-on: ubuntu-latest
150+
steps:
151+
- uses: actions/checkout@v2
152+
- name: Set up Python
153+
uses: actions/setup-python@v2
154+
with:
155+
python-version: '3.9'
156+
- name: Install dependencies
157+
run: pip install pointblank
158+
- name: Validate data quality
159+
run: |
160+
pb validate data/sales.csv --check rows-distinct --exit-code
161+
pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code
162+
pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code
163+
```
164+
165+
## Some Useful Tips
166+
167+
- use `pb validate --list-checks` to see all available checks and usage examples.
168+
- use `pb run` for advanced validation logic or when you need to chain multiple steps.
169+
- use `pb make-template` to quickly scaffold new validation scripts.
170+
171+
For more CLI usage examples and real terminal output, see the
172+
[CLI Demos](../demos/cli-interactive/index.qmd).

0 commit comments

Comments
 (0)