fix: detect headings in PDF conversion via font-size analysis by octo-patch · Pull Request #1659 · microsoft/markitdown

octo-patch · 2026-04-04T05:48:43Z

Problem

When converting PDFs (especially tagged or structured PDFs with multiple heading levels), MarkItDown produced flat plain text with no heading markers. Users lost the document structure — all text was rendered at the same level regardless of whether it was a heading or body text.

Solution

Added _extract_text_with_headings() which uses pdfminer's extract_pages layout API to inspect per-character font sizes. The function:

Collects every text block and its dominant font size.
Determines the body text size as the statistical mode of all font sizes.
Any block whose font size is ≥ 15 % larger than the body size is treated as a heading.
Up to six distinct heading sizes are mapped to # … ###### (largest → H1).
Falls back to pdfminer.high_level.extract_text() on any error.

Plain-text PDFs (no form/table pages detected by pdfplumber) now use _extract_text_with_headings() instead of the bare pdfminer.high_level.extract_text() call, so heading structure is preserved where font size information is available.

Testing

All existing PDF tests pass: test_pdf_masterformat.py, test_pdf_tables.py, test_pdf_memory.py.
New test_pdf_headings.py covers:
- PDF with 24 pt H1 and 18 pt H2 produces # … and ## … markers.
- Body text (12 pt, the mode) is not converted to a heading.
- Uniform-font PDFs (all same size) produce no heading markers.
- Function always returns a non-empty string.

) When sys.stdout.encoding is None (e.g. when stdout is redirected to a binary stream), calling str.encode(None) raises a TypeError. Fall back to utf-8 in that case so the CLI does not crash.

…icrosoft#1640) PDFs with varied font sizes now produce Markdown headings. A new _extract_text_with_headings() function uses pdfminer's layout API to compare each text block's font size against the document's body-text size (mode). Blocks 15% or more larger are mapped to heading levels H1 through H6 (largest maps to H1). Plain-text PDFs now use this function instead of the raw pdfminer.high_level.extract_text() call. Falls back to pdfminer plain extraction on any error.

octo-patch added 2 commits April 3, 2026 16:19

fix: handle None sys.stdout.encoding in CLI output (fixes microsoft#1597

a467bb6

) When sys.stdout.encoding is None (e.g. when stdout is redirected to a binary stream), calling str.encode(None) raises a TypeError. Fall back to utf-8 in that case so the CLI does not crash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect headings in PDF conversion via font-size analysis#1659

fix: detect headings in PDF conversion via font-size analysis#1659
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1640-pdf-heading-detection

octo-patch commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 4, 2026

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant