Merge pull request #235 from posit-dev/docs-cli-user-guide-articles

rich-iannone · web-flow · commit a96b4681bb3d · 2025-06-26T14:32:44.000-04:00
docs: create two API utility articles
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -89,9 +89,10 @@ website:
             - user-guide/preview.qmd
             - user-guide/col-summary-tbl.qmd
             - user-guide/missing-vals-tbl.qmd
-        - section: "CLI Utility"
+        - section: "The Pointblank CLI"
           contents:
-            - user-guide/cli.qmd
+            - user-guide/cli-data-inspection.qmd
+            - user-guide/cli-data-validation.qmd
 
   page-footer:
     left: 'Proudly supported by <a href="https://www.posit.co/" class="no-icon"><img src="/assets/posit-logo-black.svg" alt="Posit" width="80" style="padding-left: 3px;vertical-align:text-top;"></a>'
diff --git a/docs/user-guide/cli-data-inspection.qmd b/docs/user-guide/cli-data-inspection.qmd
@@ -0,0 +1,122 @@
+---
+title: Data Inspection and Exploration
+jupyter: python3
+toc-expand: 2
+---
+
+Pointblank’s CLI makes it easy to inspect, preview, and explore your data before running
+validations. This is useful for understanding your data’s structure, checking for obvious issues,
+and confirming that your data source is being read correctly.
+
+## Supported Data Sources
+
+You can inspect a wide variety of data sources using the CLI:
+
+- CSV files: single files, glob patterns
+- Parquet files: as single files, directories, or partitioned datasets
+- GitHub URLs: for CSV/Parquet files as standard or raw URLs
+- database tables: via connection strings
+- built-in datasets: these are provided by Pointblank
+
+::: {.callout-info}
+
+## Quick Reference for the Data Inspection Commands
+
+| Command         | Purpose                                      |
+|-----------------|----------------------------------------------|
+| `pb info`       | Show table type, dimensions, columns, types  |
+| `pb preview`    | Preview head/tail rows, select columns       |
+| `pb scan`       | Full column summary/profile (stats, NA, etc) |
+| `pb missing`    | Visualize missing value patterns             |
+
+:::
+
+## `pb info`: Inspecting Data Structure
+
+Use `pb info` to display basic information about your data source:
+
+```bash
+pb info data.csv
+pb info "data/*.parquet"
+pb info "duckdb:///warehouse/analytics.ddb::customer_metrics"
+pb info small_table
+```
+
+This command shows the
+
+- table type (e.g., pandas, polars, etc.)
+- number of rows and columns
+- data source path or identifier
+
+## `pb preview`: Previewing Data
+
+Use `pb preview` to view the first and last rows of your data, with flexible column selection:
+
+```bash
+pb preview data.csv
+pb preview "data/*.parquet"
+pb preview "https://github.com/user/repo/blob/main/data.csv"
+pb preview "duckdb:///path/to/db.ddb::table_name"
+pb preview small_table
+```
+
+Here are some useful options:
+
+- `--rows N`: show N rows from the top, default: 5
+- `--columns "col1,col2"`: show only specified columns
+- `--col-range "1:10"`: show columns by position
+- `--col-first N`: show first N columns
+- `--col-last N`: show last N columns
+- `--no-row-numbers`: hide row numbers
+- `--output-html file.html`: save preview as an HTML file
+
+Here's an example where only the `name`, `age`, and `email` columns from `data.csv` are shown (and
+we limit this to the top 10 rows):
+
+```bash
+pb preview data.csv --columns "name,age,email" --rows 10
+```
+
+## `pb scan`: Column Summary and Profiling
+
+Use `pb scan` for a comprehensive column summary, including:
+
+- data types
+- missing value counts
+- unique value counts
+- summary statistics (mean, standard deviation, min, max, quartiles)
+
+```bash
+pb scan data.csv
+pb scan "data/*.parquet"
+pb scan "duckdb:///warehouse/analytics.ddb::customer_metrics"
+pb scan small_table
+```
+
+Here are the options:
+
+- `--columns "col1,col2"` (scan only specified columns)
+- `--output-html file.html` (save scan as HTML report)
+
+## `pb missing`: Missing Value Patterns
+
+Use `pb missing` to generate a missing values report, visualizing missingness across columns and row
+sectors:
+
+```bash
+pb missing data.csv
+pb missing "data/*.parquet"
+pb missing "duckdb:///warehouse/analytics.ddb::customer_metrics"
+pb missing small_table
+```
+
+There's an option here as well:
+
+- `--output-html file.html` (save missing values report as HTML)
+
+## Some Useful Tips on When and How to Use
+
+- use `pb info` and before running validations to confirm your data source can be loaded.
+- use `pb preview` to quickly understand what the data looks like.
+- use `pb missing` to visualize and diagnose missing data patterns.
+- use `pb scan` for a quick data profile and to spot outliers or data quality issues.
diff --git a/docs/user-guide/cli-data-validation.qmd b/docs/user-guide/cli-data-validation.qmd
@@ -0,0 +1,172 @@
+---
+title: Data Validation and Automation
+jupyter: python3
+toc-expand: 2
+---
+
+Pointblank’s CLI makes it easy to validate your data directly from the terminal. This is ideal for
+quick checks, CI/CD pipelines, and automation workflows. The `pb run` command serves as a runner for
+validation scripts written with the Pointblank Python API, allowing you to execute more complex
+validation logic from the command line.
+
+## Supported Data Sources
+
+You can validate a wide variety of data sources using the CLI:
+
+- CSV files: single files, glob patterns
+- Parquet files: single files, directories, partitioned datasets
+- GitHub URLs: for CSV/Parquet files (standard or raw URLs)
+- database tables: via connection strings
+- built-in datasets: provided by Pointblank
+
+::: {.callout-info}
+
+## Quick Reference for the Data Validation Commands
+
+| Command         | Purpose                                              |
+|-----------------|------------------------------------------------------|
+| `pb validate`   | Run validation checks on your data                   |
+| `pb run`        | Run a Python validation script from the CLI          |
+| `pb make-template` | Generate a validation script template             |
+
+:::
+
+## `pb validate`: Run Validation Checks
+
+Use `pb validate` to perform one or more validation checks on your data source. Here's the basic
+usage pattern:
+
+```bash
+pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS]
+```
+
+Here are the supported checks and the required options in parentheses:
+
+- `rows-distinct`: check for duplicate rows (default)
+- `rows-complete`: check for missing values in any column
+- `col-exists`: check if a column exists (`--column`)
+- `col-vals-not-null`: check if a column has no null values (`--column`)
+- `col-vals-gt`: column values greater than a value (`--column`, `--value`)
+- `col-vals-ge`: column values greater than or equal to a value (`--column`, `--value`)
+- `col-vals-lt`: column values less than a value (`--column`, `--value`)
+- `col-vals-le`: column values less than or equal to a value (`--column`, `--value`)
+- `col-vals-in-set`: column values must be in a set (`--column`, `--set`)
+
+Here are a few examples:
+
+```bash
+# Check for duplicate rows (default)
+pb validate data.csv
+
+# Check if all values in 'age' are not null
+pb validate data.csv --check col-vals-not-null --column age
+
+# Check if all values in 'score' are greater than 50
+pb validate data.csv --check col-vals-gt --column score --value 50
+
+# Check if 'status' values are in a set
+pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending"
+```
+
+Several consecutive checks can be performed in one command! To do this, use `--check` multiple times
+(along with the check type and its required options).
+
+```bash
+# Perform multiple checks in one command
+pb validate data.csv --check rows-distinct --check col-vals-not-null --column id
+```
+
+There are several useful options:
+
+- `--show-extract`: show failing rows if validation fails
+- `--write-extract TEXT: save failing rows to a directory as CSV
+- `--limit INTEGER`: limit the number of failing rows shown/saved (default: 10)
+- `--exit-code`: exit with non-zero code if validation fails (for CI/CD)
+- `--list-checks`: list all available validation checks
+
+## `pb run`: Run Python Validation Scripts
+
+Use `pb run` to execute a Python script containing Pointblank validation logic. This is useful for
+more complex validations or automation.
+
+```bash
+pb run validation_script.py
+```
+
+Here are the options:
+
+- `--data [DATA_SOURCE]`: override the data source in your script (available as `cli_data`)
+- `--output-html file.html`: save validation report as HTML
+- `--output-json file.json`: save validation summary as JSON
+- `--show-extract`: show failing rows for failed steps
+- `--write-extract TEXT`: save failing rows for each step as CSVs in a directory
+- `--limit INTEGER`: limit the number of failing rows shown/saved
+- `--fail-on [critical|error|warning|any]`: exit with error if any step meets/exceeds this severity
+
+Here's an example where we:
+
+- override the input data in the script
+- output the validation report table to a file
+- signal failure (i.e., exit with non-zero code) if any 'error' threshold is met
+
+```bash
+pb run my_validation.py --data data.csv --output-html report.html --fail-on error
+```
+
+To scaffold a .py file for this, use `pb make-template`.
+
+## `pb make-template`: Generate a Validation Script Template
+
+Use this command to create a starter Python script for Pointblank validation:
+
+```bash
+pb make-template my_validation.py
+```
+
+Edit the generated script to add your own data loading and validation rules, then run it with
+`pb run`.
+
+## Integration with CI/CD
+
+Validation through the CLI provide opportunities for automation. The following features lend
+themselves well to automated processes:
+
+- validation results are shown in a clear, color-coded table
+- `--show-extract` or `--write-extract` can be used to debug failing rows
+- the `--exit-code` or `--fail-on` options are ideal for CI/CD integration.
+
+
+Here's an example of how one might integrate data validation through the Pointblank CLI into a CI/CD
+pipeline:
+
+```yaml
+# Example GitHub Actions workflow
+name: Data Validation
+on: [push, pull_request]
+
+jobs:
+  validate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Set up Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: '3.9'
+      - name: Install dependencies
+        run: pip install pointblank
+      - name: Validate data quality
+        run: |
+          pb validate data/sales.csv --check rows-distinct --exit-code
+          pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code
+          pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code
+```
+
+## Some Useful Tips
+
+- use `pb validate --list-checks` to see all available checks and usage examples.
+- use `pb run` for advanced validation logic or when you need to chain multiple steps.
+- use `pb make-template` to quickly scaffold new validation scripts.
+
+For more CLI usage examples and real terminal output, see the
+[CLI Demos](../demos/cli-interactive/index.qmd).
diff --git a/docs/user-guide/cli.qmd b/docs/user-guide/cli.qmd