Skip to content

Commit 8158a1c

Browse files
lesebclaude
andauthored
feat(file_processors): add inline::auto composite file processor (#5673)
# What does this PR do? Adds a new `inline::auto` composite file processor that dispatches to the appropriate backend based on file MIME type. Currently routes PDF and text files to the built-in PyPDF processor and rejects unsupported formats with a clear 422 error listing supported types. The architecture is extensible for adding additional format backends (e.g. docling) in the future. Switches the starter and ci-tests distributions from `inline::pypdf` to `inline::auto` as the default file processor. Admins who want direct control over which formats are processed can still configure a specific provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`) directly. Builds on #5670 which fixed pypdf's silent fallback for unsupported formats. ## Test Plan Run the file processor unit tests: ```bash uv run pytest tests/unit/providers/file_processor/ -v ``` Output: ``` tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED 35 passed in 0.64s ``` --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f222f29 commit 8158a1c

14 files changed

Lines changed: 347 additions & 22 deletions

File tree

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
description: "Composite file processor that automatically dispatches to the appropriate backend based on file MIME type. Routes PDF and text files to PyPDF. Unsupported formats are rejected with a clear error listing the supported types."
3+
sidebar_label: Auto
4+
title: inline::auto
5+
---
6+
7+
# inline::auto
8+
9+
## Description
10+
11+
Composite file processor that automatically dispatches to the appropriate backend based on file MIME type. Routes PDF and text files to PyPDF. Unsupported formats are rejected with a clear error listing the supported types.
12+
13+
## Configuration
14+
15+
| Field | Type | Required | Default | Description |
16+
|-------|------|----------|---------|-------------|
17+
| `default_chunk_size_tokens` | `int` | No | 800 | Default chunk size in tokens when chunking_strategy type is 'auto' |
18+
| `default_chunk_overlap_tokens` | `int` | No | 400 | Default chunk overlap in tokens when chunking_strategy type is 'auto' |
19+
| `extract_metadata` | `bool` | No | True | Whether to extract PDF metadata (title, author, etc.) |
20+
| `clean_text` | `bool` | No | True | Whether to clean extracted text (remove extra whitespace, normalize line breaks) |
21+
22+
## Sample Configuration
23+
24+
```yaml
25+
{}
26+
```

src/ogx/distributions/ci-tests/build.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ distribution_spec:
3131
files:
3232
- provider_type: inline::localfs
3333
file_processors:
34-
- provider_type: inline::pypdf
34+
- provider_type: inline::auto
3535
safety:
3636
- provider_type: inline::llama-guard
3737
- provider_type: inline::code-scanner

src/ogx/distributions/ci-tests/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,8 +191,8 @@ providers:
191191
table_name: files_metadata
192192
backend: sql_default
193193
file_processors:
194-
- provider_id: pypdf
195-
provider_type: inline::pypdf
194+
- provider_id: auto
195+
provider_type: inline::auto
196196
safety:
197197
- provider_id: llama-guard
198198
provider_type: inline::llama-guard

src/ogx/distributions/ci-tests/run-with-postgres-store.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,8 +191,8 @@ providers:
191191
table_name: files_metadata
192192
backend: sql_default
193193
file_processors:
194-
- provider_id: pypdf
195-
provider_type: inline::pypdf
194+
- provider_id: auto
195+
provider_type: inline::auto
196196
safety:
197197
- provider_id: llama-guard
198198
provider_type: inline::llama-guard

src/ogx/distributions/starter/build.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ distribution_spec:
3232
files:
3333
- provider_type: inline::localfs
3434
file_processors:
35-
- provider_type: inline::pypdf
35+
- provider_type: inline::auto
3636
safety:
3737
- provider_type: inline::llama-guard
3838
- provider_type: inline::code-scanner

src/ogx/distributions/starter/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,8 +185,8 @@ providers:
185185
table_name: files_metadata
186186
backend: sql_default
187187
file_processors:
188-
- provider_id: pypdf
189-
provider_type: inline::pypdf
188+
- provider_id: auto
189+
provider_type: inline::auto
190190
safety:
191191
- provider_id: llama-guard
192192
provider_type: inline::llama-guard

src/ogx/distributions/starter/run-with-postgres-store.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,8 +185,8 @@ providers:
185185
table_name: files_metadata
186186
backend: sql_default
187187
file_processors:
188-
- provider_id: pypdf
189-
provider_type: inline::pypdf
188+
- provider_id: auto
189+
provider_type: inline::auto
190190
safety:
191191
- provider_id: llama-guard
192192
provider_type: inline::llama-guard

src/ogx/distributions/starter/starter.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
from ogx.core.storage.sqlstore.sqlstore import PostgresSqlStoreConfig
2323
from ogx.core.utils.dynamic import instantiate_class_type
2424
from ogx.distributions.template import DistributionTemplate, RunConfigSettings
25-
from ogx.providers.inline.file_processor.pypdf.config import PyPDFFileProcessorConfig
25+
from ogx.providers.inline.file_processor.auto.config import AutoFileProcessorConfig
2626
from ogx.providers.inline.files.localfs.config import LocalfsFilesImplConfig
2727
from ogx.providers.inline.inference.sentence_transformers import (
2828
SentenceTransformersInferenceConfig,
@@ -148,7 +148,7 @@ def get_distribution_template(name: str = "starter") -> DistributionTemplate:
148148
BuildProvider(provider_type="remote::infinispan"),
149149
],
150150
"files": [BuildProvider(provider_type="inline::localfs")],
151-
"file_processors": [BuildProvider(provider_type="inline::pypdf")],
151+
"file_processors": [BuildProvider(provider_type="inline::auto")],
152152
"safety": [
153153
BuildProvider(provider_type="inline::llama-guard"),
154154
BuildProvider(provider_type="inline::code-scanner"),
@@ -267,9 +267,9 @@ def get_distribution_template(name: str = "starter") -> DistributionTemplate:
267267
"files": [files_provider],
268268
"file_processors": [
269269
Provider(
270-
provider_id="pypdf",
271-
provider_type="inline::pypdf",
272-
config=PyPDFFileProcessorConfig.sample_run_config(),
270+
provider_id="auto",
271+
provider_type="inline::auto",
272+
config=AutoFileProcessorConfig.sample_run_config(),
273273
),
274274
],
275275
"tool_runtime": [
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Copyright (c) The OGX Contributors.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the terms described in the LICENSE file in
5+
# the root directory of this source tree.
6+
7+
from typing import Any
8+
9+
from ogx_api import Api
10+
11+
from .config import AutoFileProcessorConfig
12+
13+
14+
async def get_provider_impl(config: AutoFileProcessorConfig, deps: dict[Api, Any]):
15+
"""Get the auto file processor implementation."""
16+
from .auto import AutoFileProcessor
17+
18+
assert isinstance(config, AutoFileProcessorConfig), f"Unexpected config type: {type(config)}"
19+
20+
files_api = deps[Api.files]
21+
22+
impl = AutoFileProcessor(config, files_api)
23+
return impl
24+
25+
26+
__all__ = ["AutoFileProcessorConfig", "get_provider_impl"]
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Copyright (c) The OGX Contributors.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the terms described in the LICENSE file in
5+
# the root directory of this source tree.
6+
7+
import mimetypes
8+
9+
from fastapi import HTTPException, UploadFile
10+
11+
from ogx.log import get_logger
12+
from ogx.providers.inline.file_processor.pypdf.config import PyPDFFileProcessorConfig
13+
from ogx.providers.inline.file_processor.pypdf.pypdf import PyPDFFileProcessor
14+
from ogx_api.file_processors import ProcessFileRequest, ProcessFileResponse
15+
from ogx_api.files import RetrieveFileRequest
16+
17+
from .config import AutoFileProcessorConfig
18+
19+
log = get_logger(name=__name__, category="providers::file_processors")
20+
21+
SUPPORTED_TEXT_DESCRIPTION = "PDF and text files (txt, csv, md, etc.)"
22+
23+
24+
class AutoFileProcessor:
25+
"""Composite file processor that dispatches to backends based on MIME type.
26+
27+
Routes PDF and text files to the built-in PyPDF processor. Unsupported
28+
formats are rejected with a 422 error listing the supported types.
29+
"""
30+
31+
def __init__(self, config: AutoFileProcessorConfig, files_api) -> None:
32+
self.config = config
33+
self.files_api = files_api
34+
35+
pypdf_config = PyPDFFileProcessorConfig(
36+
default_chunk_size_tokens=config.default_chunk_size_tokens,
37+
default_chunk_overlap_tokens=config.default_chunk_overlap_tokens,
38+
extract_metadata=config.extract_metadata,
39+
clean_text=config.clean_text,
40+
)
41+
self.pypdf = PyPDFFileProcessor(pypdf_config, files_api)
42+
43+
async def process_file(
44+
self,
45+
request: ProcessFileRequest,
46+
file: UploadFile | None = None,
47+
) -> ProcessFileResponse:
48+
filename = await self._resolve_filename(request, file)
49+
mime_type, _ = mimetypes.guess_type(filename)
50+
mime_category = mime_type.split("/")[0] if (mime_type and "/" in mime_type) else None
51+
52+
if mime_type == "application/pdf" or mime_category == "text":
53+
return await self.pypdf.process_file(
54+
file=file,
55+
file_id=request.file_id,
56+
options=request.options,
57+
chunking_strategy=request.chunking_strategy,
58+
)
59+
60+
raise HTTPException(
61+
status_code=422,
62+
detail=(
63+
f"File type '{mime_type or 'unknown'}' is not supported. Supported types: {SUPPORTED_TEXT_DESCRIPTION}."
64+
),
65+
)
66+
67+
async def _resolve_filename(self, request: ProcessFileRequest, file: UploadFile | None) -> str:
68+
if file is not None:
69+
name: str | None = file.filename
70+
if name is not None:
71+
return name
72+
if request.file_id is not None:
73+
file_info = await self.files_api.openai_retrieve_file(RetrieveFileRequest(file_id=request.file_id))
74+
resolved: str = file_info.filename
75+
return resolved
76+
return "unknown"
77+
78+
async def shutdown(self) -> None:
79+
pass

0 commit comments

Comments
 (0)