🧭 Quick Return to Map
You are in a sub-page of OCR_Parsing.
To reorient, go back here:
- OCR_Parsing — text recognition and document structure parsing
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize OCR extraction on noisy scans, low-resolution images, and multi-generation photocopies. Ensure text is auditable, retrievable, and bound by schema despite quality issues.
- OCR parsing checklist: ocr-parsing-checklist.md
- Data contracts: data-contracts.md
- Hallucination control: hallucination.md
- Chunking guide: chunking-checklist.md
- OCR character error rate (CER) ≤ 2% after cleanup
- ΔS(question, retrieved) ≤ 0.45 even when scan quality < 300 dpi
- λ remains convergent across paraphrases
- All extracted text auditable against source image hash
-
Broken characters and merged glyphs
Apply normalization and Unicode repair before indexing. Validate against a whitelist of expected ranges. -
Multi-generation photocopy blur
Route through OCR engine supporting adaptive binarization. Anchor outputs with image hash to avoid ghost drift. -
Double-encoded PDFs (text + image overlay)
Deduplicate layers. Choose the higher-confidence text layer and tag source. -
Skewed pages or rotated scans
Run deskew filter before OCR. Capture skew angle metadata for audit. -
Mixed-language or font variants
Force language models per region. Split by script. Store per-block language code. -
Noise artifacts (staple marks, stamps, watermarks)
Strip bounding boxes below token threshold. Mark asnoise_blockinstead of narrative text.
-
Hash source image
Storescan_idandimage_hashfor every page. Tie all extracted text back to this anchor. -
Normalize text
Apply Unicode NFKC. Collapse broken ligatures and fix spacing errors. -
De-layer double PDFs
Choose the OCR text layer with confidence ≥ 0.90. Drop shadow text. -
Audit with ΔS
Probe scanned text with 3 paraphrases. If ΔS ≥ 0.60, run re-OCR with stricter binarization. -
Chunk and contract
Split by page. Enforce data contract fields:page_no,scan_id,text_clean,bbox.
-
Google Document AI
UsequalityScores.confidencefield. Reject blocks with confidence < 0.7. -
AWS Textract
HashBlockType=PAGE. Keep page-level confidence. Store asscan_id. -
Azure OCR
Normalize boundingRegions. Addlanguagecode explicitly if detected. -
ABBYY
Use<charParams>confidence. Flag low confidence segments for secondary OCR. -
PaddleOCR
Use angle classification for deskew. Split multilingual pages into per-line language tags.
{
"scan\_id": "p12\_imghash",
"page\_no": 12,
"image\_hash": "sha256:...",
"text\_clean": "...",
"language": "en",
"confidence": 0.92,
"noise\_blocks": \[...],
"source\_url": "..."
}
- Leak check: ensure no shadow/duplicate text.
- Quality probe: CER ≤ 2% on 1k sample chars.
- Stability probe: ΔS stable across paraphrases.
- Auditability: all text traceable to image hash.
You have TXTOS and WFGY Problem Map.
My scan:
* page\_no: {n}
* text\_clean: "..."
* confidence: 0.xx
* image\_hash: "..."
Tasks:
1. If text looks corrupted, fail fast and cite fix page.
2. Validate schema (ocr-parsing-checklist, data-contracts).
3. Return JSON: { "answer":"...", "citations":\[...], "ΔS":0.xx, "λ\_state":"..." }
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
要不要我直接幫你接續做下一個 multi_language_and_fonts.md?