Skip to content

fix: ensure that the CLI uses the centralized ingest functionality #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,11 +163,14 @@ Pointblank includes a powerful CLI utility called `pb` that lets you run data va
# Get a quick preview of your data
pb preview small_table

# Check for missing values
pb missing small_table
# Preview data from GitHub URLs
pb preview "https://github.com/user/repo/blob/main/data.csv"

# Generate column summaries
pb scan small_table
# Check for missing values in Parquet files
pb missing data.parquet

# Generate column summaries from database connections
pb scan "duckdb:///data/sales.ddb::customers"
```

**Run Essential Validations**
Expand All @@ -176,8 +179,11 @@ pb scan small_table
# Check for duplicate rows
pb validate-simple small_table --check rows-distinct

# Verify no null values
pb validate-simple small_table --check col-vals-not-null --column a
# Validate data directly from GitHub
pb validate-simple "https://github.com/user/repo/blob/main/sales.csv" --check col-vals-not-null --column customer_id

# Verify no null values in Parquet datasets
pb validate-simple "data/*.parquet" --check col-vals-not-null --column a

# Extract failing data for debugging
pb validate-simple small_table --check col-vals-gt --column a --value 5 --show-extract
Expand Down
37 changes: 18 additions & 19 deletions docs/demos/cli-interactive/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@ format:
}
---

# Command Line Interface: Interactive Demos

These CLI demos showcase practical data quality workflows that you can use!

::: {.callout-tip}
Expand Down Expand Up @@ -93,49 +91,50 @@ Follow an end-to-end data quality pipeline combining exploration, validation, an

Ready to implement data quality workflows? Here's how to get started:

### 1. Install and Verify
#### 1. Install and Verify

```bash
pip install pointblank
pb --help
```

### 2. Explore Your Data
#### 2. Explore Various Data Sources

```bash
# Get a quick preview of your data
# Built-in datasets
pb preview small_table

# Check for missing values
pb missing small_table
# Local files with patterns
pb preview "data/*.parquet"
pb scan sales_data.csv

# GitHub repositories (no download required)
pb preview "https://github.com/user/repo/blob/main/data.csv"
pb missing "https://raw.githubusercontent.com/user/repo/main/sales.parquet"

# Generate column summaries
pb scan small_table
# Database connections
pb info "duckdb:///warehouse/analytics.ddb::customers"
```

### 3. Run Essential Validations
#### 3. Run Essential Validations

```bash
# Check for duplicate rows
pb validate-simple small_table --check rows-distinct

# Verify no null values
pb validate-simple small_table --check col-vals-not-null --column a
# Validate data from multiple sources
pb validate-simple "data/*.parquet" --check col-vals-not-null --column customer_id
pb validate-simple "https://github.com/user/repo/blob/main/sales.csv" --check rows-distinct

# Extract failing data for debugging
pb validate-simple small_table --check col-vals-gt --column a --value 5 --show-extract
```

### 4. Integrate with CI/CD
#### 4. Integrate with CI/CD

```bash
# Use exit codes for automation (0 = pass, 1 = fail)
pb validate-simple small_table --check rows-distinct && echo "✅ Quality checks passed"
```

::: {.callout-tip}
## Next Steps
- visit the [CLI User Guide](../../user-guide/cli.qmd) for detailed documentation
- use `pb validate-simple --help` for validation command options
- Combine exploration, validation, and profiling for robust data quality pipelines
:::

73 changes: 70 additions & 3 deletions docs/user-guide/cli.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,24 +44,91 @@ You can validate various types of data sources:
pb validate-simple data.csv --check rows-distinct
```

*Parquet files*
*Parquet files (including glob patterns and directories)*

```bash
pb validate-simple data.parquet --check col-vals-not-null --column age
pb validate-simple "data/*.parquet" --check rows-distinct
pb validate-simple data/ --check rows-complete # directory of parquet files
```

*Database tables*
*GitHub URLs (direct links to CSV or Parquet files)*

```bash
pb validate-simple "https://github.com/user/repo/blob/main/data.csv" --check rows-distinct
pb validate-simple "https://raw.githubusercontent.com/user/repo/main/data.parquet" --check col-exists --column id
```

*Database tables (connection strings)*

```bash
pb validate-simple "duckdb:///path/to/db.ddb::table_name" --check rows-complete
```

*built-in datasets*
*Built-in datasets*

```bash
pb validate-simple small_table --check col-exists --column a
```

## Enhanced Data Source Support

The CLI leverages Pointblank's centralized data processing pipeline, providing comprehensive support for various data sources:

### GitHub Integration

Validate data directly from GitHub repositories without downloading files:

```bash
# Standard GitHub URLs (automatically converted to raw URLs)
pb preview "https://github.com/user/repo/blob/main/data.csv"
pb validate-simple "https://github.com/user/repo/blob/main/sales.csv" --check rows-distinct

# Raw GitHub URLs (used directly)
pb scan "https://raw.githubusercontent.com/user/repo/main/data.parquet"
```

### Advanced File Patterns

Support for complex file patterns and directory structures:

```bash
# Glob patterns for multiple files
pb validate-simple "data/*.parquet" --check col-vals-not-null --column id
pb preview "sales_data_*.csv"

# Entire directories of Parquet files
pb scan data/partitioned_dataset/
pb missing warehouse/daily_reports/

# Partitioned datasets (automatically detects partition columns)
pb validate-simple partitioned_sales/ --check rows-distinct
```

### Database Connections

Enhanced support for database connection strings:

```bash
# DuckDB databases with table specification
pb validate-simple "duckdb:///warehouse/analytics.ddb::customer_metrics" --check col-exists --column customer_id

# Preview database tables
pb preview "duckdb:///data/sales.ddb::transactions"
```

### Automatic Data Type Detection

The CLI automatically detects and handles:

- CSV files: single files or glob patterns
- Parquet files: files, patterns, directories, and partitioned datasets
- GitHub URLs: both standard and raw URLs for CSV/Parquet files
- database connections: connection strings with table specifications
- built-in datasets: Pointblank's included sample datasets

This unified approach means you can use the same CLI commands regardless of where your data is stored.

## Available Validation Checks

### Data Completeness
Expand Down
Loading
Loading