Skip to content

Commit c6308dc

Browse files
authored
[MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files
1 parent 4a5340f commit c6308dc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+2464
-2
lines changed

README.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
1010
> [!IMPORTANT]
1111
> Breaking changes between 0.0.1 to 0.1.0:
12-
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
12+
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
1313
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
1414
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
1515
@@ -132,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf
132132

133133
To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
134134

135+
#### markitdown-ocr Plugin
136+
137+
The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.
138+
139+
**Installation:**
140+
141+
```bash
142+
pip install markitdown-ocr
143+
pip install openai # or any OpenAI-compatible client
144+
```
145+
146+
**Usage:**
147+
148+
Pass the same `llm_client` and `llm_model` you would use for image descriptions:
149+
150+
```python
151+
from markitdown import MarkItDown
152+
from openai import OpenAI
153+
154+
md = MarkItDown(
155+
enable_plugins=True,
156+
llm_client=OpenAI(),
157+
llm_model="gpt-4o",
158+
)
159+
result = md.convert("document_with_images.pdf")
160+
print(result.text_content)
161+
```
162+
163+
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.
164+
165+
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
166+
135167
### Azure Document Intelligence
136168

137169
To use Microsoft Document Intelligence for conversion:

packages/markitdown-ocr/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE

packages/markitdown-ocr/README.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# MarkItDown OCR Plugin
2+
3+
LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
4+
5+
Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
6+
7+
## Features
8+
9+
- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
10+
- **Enhanced DOCX Converter**: OCR for images in Word documents
11+
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
12+
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
13+
- **Context Preservation**: Maintains document structure and flow when inserting extracted text
14+
15+
## Installation
16+
17+
```bash
18+
pip install markitdown-ocr
19+
```
20+
21+
The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
22+
23+
```bash
24+
pip install openai
25+
```
26+
27+
## Usage
28+
29+
### Command Line
30+
31+
```bash
32+
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
33+
```
34+
35+
### Python API
36+
37+
Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:
38+
39+
```python
40+
from markitdown import MarkItDown
41+
from openai import OpenAI
42+
43+
md = MarkItDown(
44+
enable_plugins=True,
45+
llm_client=OpenAI(),
46+
llm_model="gpt-4o",
47+
)
48+
49+
result = md.convert("document_with_images.pdf")
50+
print(result.text_content)
51+
```
52+
53+
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
54+
55+
### Custom Prompt
56+
57+
Override the default extraction prompt for specialized documents:
58+
59+
```python
60+
md = MarkItDown(
61+
enable_plugins=True,
62+
llm_client=OpenAI(),
63+
llm_model="gpt-4o",
64+
llm_prompt="Extract all text from this image, preserving table structure.",
65+
)
66+
```
67+
68+
### Any OpenAI-Compatible Client
69+
70+
Works with any client that follows the OpenAI API:
71+
72+
```python
73+
from openai import AzureOpenAI
74+
75+
md = MarkItDown(
76+
enable_plugins=True,
77+
llm_client=AzureOpenAI(
78+
api_key="...",
79+
azure_endpoint="https://your-resource.openai.azure.com/",
80+
api_version="2024-02-01",
81+
),
82+
llm_model="gpt-4o",
83+
)
84+
```
85+
86+
## How It Works
87+
88+
When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
89+
90+
1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
91+
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
92+
3. The plugin creates an `LLMVisionOCRService` from those kwargs
93+
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0
94+
95+
When a file is converted:
96+
97+
1. The OCR converter accepts the file
98+
2. It extracts embedded images from the document
99+
3. Each image is sent to the LLM with an extraction prompt
100+
4. The returned text is inserted inline, preserving document structure
101+
5. If the LLM call fails, conversion continues without that image's text
102+
103+
## Supported File Formats
104+
105+
### PDF
106+
107+
- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
108+
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
109+
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.
110+
111+
### DOCX
112+
113+
- Images are extracted via document part relationships (`doc.part.rels`).
114+
- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
115+
- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.
116+
117+
### PPTX
118+
119+
- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
120+
- Shapes are processed in top-to-left reading order per slide.
121+
- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.
122+
123+
### XLSX
124+
125+
- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
126+
- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
127+
- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.
128+
129+
### Output format
130+
131+
Every extracted OCR block is wrapped as:
132+
133+
```text
134+
*[Image OCR]
135+
<extracted text>
136+
[End OCR]*
137+
```
138+
139+
## Troubleshooting
140+
141+
### OCR text missing from output
142+
143+
The most likely cause is a missing `llm_client` or `llm_model`. Verify:
144+
145+
```python
146+
from openai import OpenAI
147+
from markitdown import MarkItDown
148+
149+
md = MarkItDown(
150+
enable_plugins=True,
151+
llm_client=OpenAI(), # required
152+
llm_model="gpt-4o", # required
153+
)
154+
```
155+
156+
### Plugin not loading
157+
158+
Confirm the plugin is installed and discovered:
159+
160+
```bash
161+
markitdown --list-plugins # should show: ocr
162+
```
163+
164+
### API errors
165+
166+
The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
167+
168+
## Development
169+
170+
### Running Tests
171+
172+
```bash
173+
cd packages/markitdown-ocr
174+
pytest tests/ -v
175+
```
176+
177+
### Building from Source
178+
179+
```bash
180+
git clone https://github.com/microsoft/markitdown.git
181+
cd markitdown/packages/markitdown-ocr
182+
pip install -e .
183+
```
184+
185+
## Contributing
186+
187+
Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.
188+
189+
## License
190+
191+
MIT — see [LICENSE](LICENSE).
192+
193+
## Changelog
194+
195+
### 0.1.0 (Initial Release)
196+
197+
- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
198+
- Full-page OCR fallback for scanned PDFs
199+
- Context-aware inline text insertion
200+
- Priority-based converter replacement (no code changes required)
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
[build-system]
2+
requires = ["hatchling"]
3+
build-backend = "hatchling.build"
4+
5+
[project]
6+
name = "markitdown-ocr"
7+
dynamic = ["version"]
8+
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
9+
readme = "README.md"
10+
requires-python = ">=3.10"
11+
license = "MIT"
12+
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
13+
authors = [
14+
{ name = "Contributors", email = "noreply@github.com" },
15+
]
16+
classifiers = [
17+
"Development Status :: 4 - Beta",
18+
"Programming Language :: Python",
19+
"Programming Language :: Python :: 3.10",
20+
"Programming Language :: Python :: 3.11",
21+
"Programming Language :: Python :: 3.12",
22+
"Programming Language :: Python :: 3.13",
23+
"Programming Language :: Python :: Implementation :: CPython",
24+
]
25+
26+
# Core dependencies — matches the file-format libraries markitdown already uses
27+
dependencies = [
28+
"markitdown>=0.1.0",
29+
"pdfminer.six>=20251230",
30+
"pdfplumber>=0.11.9",
31+
"PyMuPDF>=1.24.0",
32+
"mammoth~=1.11.0",
33+
"python-docx",
34+
"python-pptx",
35+
"pandas",
36+
"openpyxl",
37+
"Pillow>=9.0.0",
38+
]
39+
40+
# llm_client is passed in by the user (same as for markitdown image descriptions);
41+
# install openai or any OpenAI-compatible SDK separately.
42+
[project.optional-dependencies]
43+
llm = [
44+
"openai>=1.0.0",
45+
]
46+
47+
[project.urls]
48+
Documentation = "https://github.com/microsoft/markitdown#readme"
49+
Issues = "https://github.com/microsoft/markitdown/issues"
50+
Source = "https://github.com/microsoft/markitdown"
51+
52+
[tool.hatch.version]
53+
path = "src/markitdown_ocr/__about__.py"
54+
55+
# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
56+
[project.entry-points."markitdown.plugin"]
57+
ocr = "markitdown_ocr"
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# SPDX-FileCopyrightText: 2025-present Contributors
2+
# SPDX-License-Identifier: MIT
3+
4+
__version__ = "0.1.0"
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# SPDX-FileCopyrightText: 2025-present Contributors
2+
# SPDX-License-Identifier: MIT
3+
4+
"""
5+
markitdown-ocr: OCR plugin for MarkItDown
6+
7+
Adds LLM Vision-based text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
8+
"""
9+
10+
from ._plugin import __plugin_interface_version__, register_converters
11+
from .__about__ import __version__
12+
from ._ocr_service import (
13+
OCRResult,
14+
LLMVisionOCRService,
15+
)
16+
from ._pdf_converter_with_ocr import PdfConverterWithOCR
17+
from ._docx_converter_with_ocr import DocxConverterWithOCR
18+
from ._pptx_converter_with_ocr import PptxConverterWithOCR
19+
from ._xlsx_converter_with_ocr import XlsxConverterWithOCR
20+
21+
__all__ = [
22+
"__version__",
23+
"__plugin_interface_version__",
24+
"register_converters",
25+
"OCRResult",
26+
"LLMVisionOCRService",
27+
"PdfConverterWithOCR",
28+
"DocxConverterWithOCR",
29+
"PptxConverterWithOCR",
30+
"XlsxConverterWithOCR",
31+
]

0 commit comments

Comments
 (0)