fix(file_processors): reject unsupported file types in pypdf processor#5670
Merged
leseb merged 2 commits intoogx-ai:mainfrom Apr 30, 2026
Merged
fix(file_processors): reject unsupported file types in pypdf processor#5670leseb merged 2 commits intoogx-ai:mainfrom
leseb merged 2 commits intoogx-ai:mainfrom
Conversation
…r instead of silent fallback The pypdf file processor was silently falling back to text extraction for binary formats like DOCX, PPTX, and XLSX, producing garbage results. Now it raises a 422 error with a helpful message directing users to use the docling file processors for those formats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sébastien Han <seb@redhat.com>
mattf
reviewed
Apr 30, 2026
Collaborator
mattf
left a comment
There was a problem hiding this comment.
it's good to not silently fail.
how does the receiver of this message change their request to use docling or docling-serve?
Collaborator
Author
good point, the problem is that the receiver might not in position to make that change but perhaps relay that information to the admin that configured OGX? |
Collaborator
Author
|
ok so i think i'll just go with an error saying the format we support and later add support for more file type. |
Only state which formats are supported instead of suggesting alternative providers that the end user may not be able to configure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sébastien Han <seb@redhat.com>
mattf
approved these changes
Apr 30, 2026
leseb
added a commit
to leseb/ogx
that referenced
this pull request
May 4, 2026
…-ai#5673) # What does this PR do? Adds a new `inline::auto` composite file processor that dispatches to the appropriate backend based on file MIME type. Currently routes PDF and text files to the built-in PyPDF processor and rejects unsupported formats with a clear 422 error listing supported types. The architecture is extensible for adding additional format backends (e.g. docling) in the future. Switches the starter and ci-tests distributions from `inline::pypdf` to `inline::auto` as the default file processor. Admins who want direct control over which formats are processed can still configure a specific provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`) directly. Builds on ogx-ai#5670 which fixed pypdf's silent fallback for unsupported formats. ## Test Plan Run the file processor unit tests: ```bash uv run pytest tests/unit/providers/file_processor/ -v ``` Output: ``` tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED 35 passed in 0.64s ``` --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
zy1o
pushed a commit
to zy1o/ogx
that referenced
this pull request
May 4, 2026
…-ai#5673) # What does this PR do? Adds a new `inline::auto` composite file processor that dispatches to the appropriate backend based on file MIME type. Currently routes PDF and text files to the built-in PyPDF processor and rejects unsupported formats with a clear 422 error listing supported types. The architecture is extensible for adding additional format backends (e.g. docling) in the future. Switches the starter and ci-tests distributions from `inline::pypdf` to `inline::auto` as the default file processor. Admins who want direct control over which formats are processed can still configure a specific provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`) directly. Builds on ogx-ai#5670 which fixed pypdf's silent fallback for unsupported formats. ## Test Plan Run the file processor unit tests: ```bash uv run pytest tests/unit/providers/file_processor/ -v ``` Output: ``` tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED 35 passed in 0.64s ``` --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
The pypdf file processor was silently falling back to text extraction for binary formats like DOCX, PPTX, and XLSX, producing garbage results. This PR makes it raise a 422 error with a helpful message directing users to use the
inline::doclingorremote::docling-servefile processors for those formats. PDF and text-based files (txt, csv, md, etc.) continue to work as before.Test Plan
Run the new unit tests:
Output: