microsoft
diff --git a/‎README.md‎
Lines changed: 33 additions & 1 deletion b/‎README.md‎
Lines changed: 33 additions & 1 deletion
diff --git a/‎packages/markitdown-ocr/LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎packages/markitdown-ocr/LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎packages/markitdown-ocr/README.md‎
Lines changed: 200 additions & 0 deletions b/‎packages/markitdown-ocr/README.md‎
Lines changed: 200 additions & 0 deletions
diff --git a/‎packages/markitdown-ocr/pyproject.toml‎
Lines changed: 57 additions & 0 deletions b/‎packages/markitdown-ocr/pyproject.toml‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎packages/markitdown-ocr/src/markitdown_ocr/__about__.py‎
Lines changed: 4 additions & 0 deletions b/‎packages/markitdown-ocr/src/markitdown_ocr/__about__.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎packages/markitdown-ocr/src/markitdown_ocr/__init__.py‎
Lines changed: 31 additions & 0 deletions b/‎packages/markitdown-ocr/src/markitdown_ocr/__init__.py‎
Lines changed: 31 additions & 0 deletions
@@ -9,7 +9,7 @@
 
 > [!IMPORTANT]
 > Breaking changes between 0.0.1 to 0.1.0:
-> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior. 
+> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
 > * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
 > * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
 
@@ -132,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf
 
 To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
 
+#### markitdown-ocr Plugin
+
+The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.
+
+**Installation:**
+
+```bash
+pip install markitdown-ocr
+pip install openai  # or any OpenAI-compatible client
+```
+
+**Usage:**
+
+Pass the same `llm_client` and `llm_model` you would use for image descriptions:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+md = MarkItDown(
+    enable_plugins=True,
+    llm_client=OpenAI(),
+    llm_model="gpt-4o",
+)
+result = md.convert("document_with_images.pdf")
+print(result.text_content)
+```
+
+If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.
+
+See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
+
 ### Azure Document Intelligence
 
 To use Microsoft Document Intelligence for conversion:
 
@@ -0,0 +1,21 @@
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
@@ -0,0 +1,200 @@
+# MarkItDown OCR Plugin
+
+LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
+
+Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
+
+## Features
+
+- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
+- **Enhanced DOCX Converter**: OCR for images in Word documents
+- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
+- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
+- **Context Preservation**: Maintains document structure and flow when inserting extracted text
+
+## Installation
+
+```bash
+pip install markitdown-ocr
+```
+
+The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
+
+```bash
+pip install openai
+```
+
+## Usage
+
+### Command Line
+
+```bash
+markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
+```
+
+### Python API
+
+Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+md = MarkItDown(
+    enable_plugins=True,
+    llm_client=OpenAI(),
+    llm_model="gpt-4o",
+)
+
+result = md.convert("document_with_images.pdf")
+print(result.text_content)
+```
+
+If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
+
+### Custom Prompt
+
+Override the default extraction prompt for specialized documents:
+
+```python
+md = MarkItDown(
+    enable_plugins=True,
+    llm_client=OpenAI(),
+    llm_model="gpt-4o",
+    llm_prompt="Extract all text from this image, preserving table structure.",
+)
+```
+
+### Any OpenAI-Compatible Client
+
+Works with any client that follows the OpenAI API:
+
+```python
+from openai import AzureOpenAI
+
+md = MarkItDown(
+    enable_plugins=True,
+    llm_client=AzureOpenAI(
+        api_key="...",
+        azure_endpoint="https://your-resource.openai.azure.com/",
+        api_version="2024-02-01",
+    ),
+    llm_model="gpt-4o",
+)
+```
+
+## How It Works
+
+When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
+
+1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
+2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
+3. The plugin creates an `LLMVisionOCRService` from those kwargs
+4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0
+
+When a file is converted:
+
+1. The OCR converter accepts the file
+2. It extracts embedded images from the document
+3. Each image is sent to the LLM with an extraction prompt
+4. The returned text is inserted inline, preserving document structure
+5. If the LLM call fails, conversion continues without that image's text
+
+## Supported File Formats
+
+### PDF
+
+- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
+- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
+- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.
+
+### DOCX
+
+- Images are extracted via document part relationships (`doc.part.rels`).
+- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
+- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.
+
+### PPTX
+
+- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
+- Shapes are processed in top-to-left reading order per slide.
+- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.
+
+### XLSX
+
+- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
+- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
+- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.
+
+### Output format
+
+Every extracted OCR block is wrapped as:
+
+```text
+*[Image OCR]
+<extracted text>
+[End OCR]*
+```
+
+## Troubleshooting
+
+### OCR text missing from output
+
+The most likely cause is a missing `llm_client` or `llm_model`. Verify:
+
+```python
+from openai import OpenAI
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    enable_plugins=True,
+    llm_client=OpenAI(),   # required
+    llm_model="gpt-4o",    # required
+)
+```
+
+### Plugin not loading
+
+Confirm the plugin is installed and discovered:
+
+```bash
+markitdown --list-plugins   # should show: ocr
+```
+
+### API errors
+
+The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
+
+## Development
+
+### Running Tests
+
+```bash
+cd packages/markitdown-ocr
+pytest tests/ -v
+```
+
+### Building from Source
+
+```bash
+git clone https://github.com/microsoft/markitdown.git
+cd markitdown/packages/markitdown-ocr
+pip install -e .
+```
+
+## Contributing
+
+Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.
+
+## License
+
+MIT — see [LICENSE](LICENSE).
+
+## Changelog
+
+### 0.1.0 (Initial Release)
+
+- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
+- Full-page OCR fallback for scanned PDFs
+- Context-aware inline text insertion
+- Priority-based converter replacement (no code changes required)
@@ -0,0 +1,57 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[project]
+name = "markitdown-ocr"
+dynamic = ["version"]
+description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
+readme = "README.md"
+requires-python = ">=3.10"
+license = "MIT"
+keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
+authors = [
+  { name = "Contributors", email = "noreply@github.com" },
+]
+classifiers = [
+  "Development Status :: 4 - Beta",
+  "Programming Language :: Python",
+  "Programming Language :: Python :: 3.10",
+  "Programming Language :: Python :: 3.11",
+  "Programming Language :: Python :: 3.12",
+  "Programming Language :: Python :: 3.13",
+  "Programming Language :: Python :: Implementation :: CPython",
+]
+
+# Core dependencies — matches the file-format libraries markitdown already uses
+dependencies = [
+  "markitdown>=0.1.0",
+  "pdfminer.six>=20251230",
+  "pdfplumber>=0.11.9",
+  "PyMuPDF>=1.24.0",
+  "mammoth~=1.11.0",
+  "python-docx",
+  "python-pptx",
+  "pandas",
+  "openpyxl",
+  "Pillow>=9.0.0",
+]
+
+# llm_client is passed in by the user (same as for markitdown image descriptions);
+# install openai or any OpenAI-compatible SDK separately.
+[project.optional-dependencies]
+llm = [
+  "openai>=1.0.0",
+]
+
+[project.urls]
+Documentation = "https://github.com/microsoft/markitdown#readme"
+Issues = "https://github.com/microsoft/markitdown/issues"
+Source = "https://github.com/microsoft/markitdown"
+
+[tool.hatch.version]
+path = "src/markitdown_ocr/__about__.py"
+
+# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
+[project.entry-points."markitdown.plugin"]
+ocr = "markitdown_ocr"
@@ -0,0 +1,4 @@
+# SPDX-FileCopyrightText: 2025-present Contributors
+# SPDX-License-Identifier: MIT
+
+__version__ = "0.1.0"
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: 2025-present Contributors
+# SPDX-License-Identifier: MIT
+
+"""
+markitdown-ocr: OCR plugin for MarkItDown
+
+Adds LLM Vision-based text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
+"""
+
+from ._plugin import __plugin_interface_version__, register_converters
+from .__about__ import __version__
+from ._ocr_service import (
+    OCRResult,
+    LLMVisionOCRService,
+)
+from ._pdf_converter_with_ocr import PdfConverterWithOCR
+from ._docx_converter_with_ocr import DocxConverterWithOCR
+from ._pptx_converter_with_ocr import PptxConverterWithOCR
+from ._xlsx_converter_with_ocr import XlsxConverterWithOCR
+
+__all__ = [
+    "__version__",
+    "__plugin_interface_version__",
+    "register_converters",
+    "OCRResult",
+    "LLMVisionOCRService",
+    "PdfConverterWithOCR",
+    "DocxConverterWithOCR",
+    "PptxConverterWithOCR",
+    "XlsxConverterWithOCR",
+]