Skip to content

docs(extract-v2): complete walkthrough cookbook#61

Merged
eli-stewart merged 23 commits intomainfrom
eli/extract-v2-cookbook
Apr 1, 2026
Merged

docs(extract-v2): complete walkthrough cookbook#61
eli-stewart merged 23 commits intomainfrom
eli/extract-v2-cookbook

Conversation

@eli-stewart
Copy link
Copy Markdown
Contributor

@eli-stewart eli-stewart commented Mar 28, 2026

Summary

Complete Extract V2 cookbook as a runnable Jupyter notebook. Works locally and on Google Colab.

Sections (11)

  1. Setup - SDK install, API key prompt, file upload with auto-download for Colab
  2. Schema Generation & Validation - generate_schema from prompt/file, validate_schema
  3. Quick Start - Two-step flow (create + wait_for_completion) and run() shortcut
  4. Complex Document - 16-page Transformer patent, 15 fields with arrays
  5. Citations & Bounding Boxes - cite_sources with PDF overlay visualization (applicant, grant_date, num_claims)
  6. Confidence Scores - Per-field confidence with bar chart and threshold filtering
  7. Agentic vs Cost Effective - Side-by-side tier comparison on SaaS slide
  8. Per-Page Extraction - per_page target on patent
  9. Advanced Options - target_pages, system_prompt, saved extract/parse configurations via API
  10. Typed Results - ExtractedData.from_extract_job for Pydantic model parsing
  11. Job Management - list, get with expand, delete

Test files

  • Receipt (Noisebridge hackerspace, public)
  • Patent (US10452978 Transformer, public domain)
  • SaaS slide (CloudFlow Analytics, synthetic)

Notes

Test plan

  • 124/124 automated tests passed against staging API
  • All 11 sections run end-to-end
  • Colab compatibility verified (poppler install, file download, getpass)
  • Manual review of outputs

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Runnable Jupyter notebook covering all Extract V2 features:
- Basic extraction, complex multi-page document
- Citations with bounding box visualization
- Confidence scores with charts
- Agentic vs cost_effective comparison
- Per-page extraction, advanced options, job management

Uses SDK 2.0. Test files included (all public/synthetic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eli-stewart eli-stewart force-pushed the eli/extract-v2-cookbook branch from 99c932c to df36a3a Compare March 28, 2026 06:40
eli-stewart and others added 9 commits March 31, 2026 12:56
…pport

- Switch from create()+wait_for_completion() to run() as primary pattern
- Add Section 10: ExtractedData.from_extract_job for typed Pydantic results
- Install from PyPI (llama-cloud 2.0.0) instead of git branch
- Add poppler install for Colab compatibility
- Add getpass() fallback for API key input
- Fix config list endpoint (items not data)
- Add project_id note for multi-project setups
- Remove duplicate cells and stale references
- 11 sections, 53 cells, all tested against live API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colab only gets the notebook, not the repo file tree.
Added urllib download step that fetches test PDFs and schemas
from raw GitHub URLs on first run. Skips if files already exist locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
So the cookbook doesn't need updating after the branch merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tries the feature branch URL first, falls back to main.
Works during testing and after merge without changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spell out abbreviations, name variables after what they represent.
schema_from_prompt, cost_effective_job, confidence_metadata, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Receipt has $10 for total/subtotal/amountPaid/unitPrice, so all
bounding boxes highlight the same area. Patent has title, assignee,
filing_date, grant_date etc in different parts of the cover page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Abstract bbox covers half the page. Use patent_number, title, filing_date,
grant_date, assignee, num_claims, primary_examiner for distinct bounding boxes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
patent_number, applicant, grant_date, num_claims have clean bounding boxes.
Dropped title (was matching a cited reference) and assignee (mis-tagged).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
applicant, grant_date, num_claims all produce tight, accurate bboxes.
Dropped fields that return no citation or oversized bounding boxes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eli-stewart eli-stewart requested a review from Georgehe4 April 1, 2026 06:40
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eli-stewart eli-stewart marked this pull request as ready for review April 1, 2026 06:47
eli-stewart and others added 10 commits March 31, 2026 23:47
- Remove unused pydantic.Field import
- Add from __future__ import annotations for list[] syntax

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
stdlib -> third-party -> local, alphabetical within groups,
blank lines between groups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colab already has poppler. The apt-get call was hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Files aren't on main yet. Try main first, fall back to feature branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Colab doesn't have poppler. Previous hang was likely apt-get without
update first. Now runs apt-get update then install in one line.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…c pattern

- Rename document_input_value -> file_input throughout (per platform PR #16633)
- Remove Section 10 (ExtractedData) per Logan's feedback - internal API
- Add Pydantic model_validate pattern in Section 3 (simpler, user-facing)
- Renumber sections (now 1-10)
- Update summary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eli-stewart and others added 2 commits April 1, 2026 14:24
Remove feature branch fallback. After merge, files live on main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test file URLs now use an immutable commit SHA instead of branch names.
Notebook code evolves with main, but data files are pinned so existing
Colab links never break if files are reorganized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eli-stewart eli-stewart merged commit e45ac66 into main Apr 1, 2026
7 checks passed
@stainless-app stainless-app bot mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants