fix(file_processors): reject unsupported file types in pypdf processor by leseb · Pull Request #5670 · ogx-ai/ogx

leseb · 2026-04-30T14:20:42Z

What does this PR do?

The pypdf file processor was silently falling back to text extraction for binary formats like DOCX, PPTX, and XLSX, producing garbage results. This PR makes it raise a 422 error with a helpful message directing users to use the inline::docling or remote::docling-serve file processors for those formats. PDF and text-based files (txt, csv, md, etc.) continue to work as before.

Test Plan

Run the new unit tests:

uv run --extra starter pytest tests/unit/providers/file_processor/test_pypdf_validation.py -v

Output:

tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED
7 passed

…r instead of silent fallback The pypdf file processor was silently falling back to text extraction for binary formats like DOCX, PPTX, and XLSX, producing garbage results. Now it raises a 422 error with a helpful message directing users to use the docling file processors for those formats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sébastien Han <seb@redhat.com>

mattf

it's good to not silently fail.

how does the receiver of this message change their request to use docling or docling-serve?

leseb · 2026-04-30T15:22:38Z

it's good to not silently fail.

how does the receiver of this message change their request to use docling or docling-serve?

good point, the problem is that the receiver might not in position to make that change but perhaps relay that information to the admin that configured OGX?
but ideally we would understand the format and then route to the appropriate provider, so the message as it stands does not really help.

leseb · 2026-04-30T15:28:58Z

ok so i think i'll just go with an error saying the format we support and later add support for more file type.

Only state which formats are supported instead of suggesting alternative providers that the end user may not be able to configure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sébastien Han <seb@redhat.com>

…-ai#5673) # What does this PR do? Adds a new `inline::auto` composite file processor that dispatches to the appropriate backend based on file MIME type. Currently routes PDF and text files to the built-in PyPDF processor and rejects unsupported formats with a clear 422 error listing supported types. The architecture is extensible for adding additional format backends (e.g. docling) in the future. Switches the starter and ci-tests distributions from `inline::pypdf` to `inline::auto` as the default file processor. Admins who want direct control over which formats are processed can still configure a specific provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`) directly. Builds on ogx-ai#5670 which fixed pypdf's silent fallback for unsupported formats. ## Test Plan Run the file processor unit tests: ```bash uv run pytest tests/unit/providers/file_processor/ -v ``` Output: ``` tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED 35 passed in 0.64s ``` --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leseb requested review from bbrowning, cdoern, franciscojavierarceo, mattf and raghotham as code owners April 30, 2026 14:20

leseb enabled auto-merge April 30, 2026 14:28

mattf reviewed Apr 30, 2026

View reviewed changes

mattf approved these changes Apr 30, 2026

View reviewed changes

leseb added this pull request to the merge queue Apr 30, 2026

Merged via the queue into ogx-ai:main with commit 400a534 Apr 30, 2026
72 checks passed

leseb deleted the leseb/pypdf-reject-unsupported branch April 30, 2026 16:23

leseb mentioned this pull request Apr 30, 2026

feat(file_processors): add inline::auto composite file processor #5673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(file_processors): reject unsupported file types in pypdf processor#5670

fix(file_processors): reject unsupported file types in pypdf processor#5670
leseb merged 2 commits intoogx-ai:mainfrom
leseb:leseb/pypdf-reject-unsupported

leseb commented Apr 30, 2026

Uh oh!

mattf left a comment

Uh oh!

leseb commented Apr 30, 2026

Uh oh!

leseb commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leseb commented Apr 30, 2026

What does this PR do?

Test Plan

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

leseb commented Apr 30, 2026

Uh oh!

leseb commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants