Skip to content

fix: detect headings in PDF conversion via font-size analysis#1659

Open
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1640-pdf-heading-detection
Open

fix: detect headings in PDF conversion via font-size analysis#1659
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1640-pdf-heading-detection

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1640

Problem

When converting PDFs (especially tagged or structured PDFs with multiple heading levels), MarkItDown produced flat plain text with no heading markers. Users lost the document structure — all text was rendered at the same level regardless of whether it was a heading or body text.

Solution

Added _extract_text_with_headings() which uses pdfminer's extract_pages layout API to inspect per-character font sizes. The function:

  1. Collects every text block and its dominant font size.
  2. Determines the body text size as the statistical mode of all font sizes.
  3. Any block whose font size is ≥ 15 % larger than the body size is treated as a heading.
  4. Up to six distinct heading sizes are mapped to ####### (largest → H1).
  5. Falls back to pdfminer.high_level.extract_text() on any error.

Plain-text PDFs (no form/table pages detected by pdfplumber) now use _extract_text_with_headings() instead of the bare pdfminer.high_level.extract_text() call, so heading structure is preserved where font size information is available.

Testing

  • All existing PDF tests pass: test_pdf_masterformat.py, test_pdf_tables.py, test_pdf_memory.py.
  • New test_pdf_headings.py covers:
    • PDF with 24 pt H1 and 18 pt H2 produces # … and ## … markers.
    • Body text (12 pt, the mode) is not converted to a heading.
    • Uniform-font PDFs (all same size) produce no heading markers.
    • Function always returns a non-empty string.

)

When sys.stdout.encoding is None (e.g. when stdout is redirected to a
binary stream), calling str.encode(None) raises a TypeError. Fall back
to utf-8 in that case so the CLI does not crash.
…icrosoft#1640)

PDFs with varied font sizes now produce Markdown headings. A new
_extract_text_with_headings() function uses pdfminer's layout API to
compare each text block's font size against the document's body-text
size (mode). Blocks 15% or more larger are mapped to heading levels H1
through H6 (largest maps to H1). Plain-text PDFs now use this function
instead of the raw pdfminer.high_level.extract_text() call.
Falls back to pdfminer plain extraction on any error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Not Preserving Original Heading Structure

1 participant