Skip to content

fix(file_processors): reject unsupported file types in pypdf processor#5670

Merged
leseb merged 2 commits intoogx-ai:mainfrom
leseb:leseb/pypdf-reject-unsupported
Apr 30, 2026
Merged

fix(file_processors): reject unsupported file types in pypdf processor#5670
leseb merged 2 commits intoogx-ai:mainfrom
leseb:leseb/pypdf-reject-unsupported

Conversation

@leseb
Copy link
Copy Markdown
Collaborator

@leseb leseb commented Apr 30, 2026

What does this PR do?

The pypdf file processor was silently falling back to text extraction for binary formats like DOCX, PPTX, and XLSX, producing garbage results. This PR makes it raise a 422 error with a helpful message directing users to use the inline::docling or remote::docling-serve file processors for those formats. PDF and text-based files (txt, csv, md, etc.) continue to work as before.

Test Plan

Run the new unit tests:

uv run --extra starter pytest tests/unit/providers/file_processor/test_pypdf_validation.py -v

Output:

tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED
7 passed

…r instead of silent fallback

The pypdf file processor was silently falling back to text extraction for
binary formats like DOCX, PPTX, and XLSX, producing garbage results. Now
it raises a 422 error with a helpful message directing users to use the
docling file processors for those formats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's good to not silently fail.

how does the receiver of this message change their request to use docling or docling-serve?

@leseb
Copy link
Copy Markdown
Collaborator Author

leseb commented Apr 30, 2026

it's good to not silently fail.

how does the receiver of this message change their request to use docling or docling-serve?

good point, the problem is that the receiver might not in position to make that change but perhaps relay that information to the admin that configured OGX?
but ideally we would understand the format and then route to the appropriate provider, so the message as it stands does not really help.

@leseb
Copy link
Copy Markdown
Collaborator Author

leseb commented Apr 30, 2026

ok so i think i'll just go with an error saying the format we support and later add support for more file type.

Only state which formats are supported instead of suggesting alternative
providers that the end user may not be able to configure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb added this pull request to the merge queue Apr 30, 2026
Merged via the queue into ogx-ai:main with commit 400a534 Apr 30, 2026
72 checks passed
@leseb leseb deleted the leseb/pypdf-reject-unsupported branch April 30, 2026 16:23
leseb added a commit to leseb/ogx that referenced this pull request May 4, 2026
…-ai#5673)

# What does this PR do?

Adds a new `inline::auto` composite file processor that dispatches to
the appropriate backend based on file MIME type. Currently routes PDF
and text files to the built-in PyPDF processor and rejects unsupported
formats with a clear 422 error listing supported types. The architecture
is extensible for adding additional format backends (e.g. docling) in
the future.

Switches the starter and ci-tests distributions from `inline::pypdf` to
`inline::auto` as the default file processor. Admins who want direct
control over which formats are processed can still configure a specific
provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`)
directly.

Builds on ogx-ai#5670 which fixed pypdf's silent fallback for unsupported
formats.

## Test Plan

Run the file processor unit tests:

```bash
uv run pytest tests/unit/providers/file_processor/ -v
```

Output:
```
tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED
tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED

35 passed in 0.64s
```

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
zy1o pushed a commit to zy1o/ogx that referenced this pull request May 4, 2026
…-ai#5673)

# What does this PR do?

Adds a new `inline::auto` composite file processor that dispatches to
the appropriate backend based on file MIME type. Currently routes PDF
and text files to the built-in PyPDF processor and rejects unsupported
formats with a clear 422 error listing supported types. The architecture
is extensible for adding additional format backends (e.g. docling) in
the future.

Switches the starter and ci-tests distributions from `inline::pypdf` to
`inline::auto` as the default file processor. Admins who want direct
control over which formats are processed can still configure a specific
provider (`inline::pypdf`, `inline::docling`, `remote::docling-serve`)
directly.

Builds on ogx-ai#5670 which fixed pypdf's silent fallback for unsupported
formats.

## Test Plan

Run the file processor unit tests:

```bash
uv run pytest tests/unit/providers/file_processor/ -v
```

Output:
```
tests/unit/providers/file_processor/test_auto.py::test_routes_pdf_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_text_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_csv_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_routes_markdown_to_pypdf PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_auto.py::test_error_message_lists_supported_types PASSED
tests/unit/providers/file_processor/test_auto.py::test_error_message_includes_mime_type PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_docx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_pptx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_rejects_xlsx_with_422 PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_pdf PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_text_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_csv_files PASSED
tests/unit/providers/file_processor/test_pypdf_validation.py::test_allows_markdown_files PASSED

35 passed in 0.64s
```

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants