API server projected Pageindex #22

lichman0405 · 2025-06-04T10:02:25Z

lichman0405
Jun 4, 2025

What I've Been Working On: PageIndex - A PDF Document Structure Analyzer

https://github.com/lichman0405/pageindex_fastapi.git

I've been developing PageIndex, a powerful tool designed to automatically analyze PDF documents and extract their table of contents (TOC) structure. It leverages Large Language Models (LLMs) to understand document content and then generates a hierarchical JSON output representing the document's structure, including chapters, sub-chapters, and their corresponding page numbers.

This project is a decoupled version of Vectify AI's original PageIndex, with significant enhancements in logging capabilities.

Key Features:
Smart TOC Detection: PageIndex intelligently identifies TOC pages and extracts page numbers from PDFs.
Hierarchical Structure Generation: It builds a comprehensive, hierarchical structure of the document, making it easy to navigate.
LLM-Driven Analysis: The core of PageIndex relies on LLMs for accurate content understanding and structural analysis.
Flexible Output: You can choose to include node IDs, summaries, original text, and other information in the generated JSON.
Asynchronous Processing: The tool supports concurrent processing for improved efficiency.
Detailed Logging: I've integrated the Rich library for clear console output and detailed logs, which helps with monitoring and debugging.
Web API Support: PageIndex offers a RESTful API with asynchronous task processing, allowing for easy integration into other systems.
File Management: It supports online PDF uploads and subsequent download of the structured results.
How It Works:
PageIndex follows a robust workflow:

PDF Parsing: Extracts text content from PDFs using libraries like PyPDF2 and PyMuPDF.
TOC Detection: Smartly identifies the table of contents pages.
Structure Analysis: Uses LLMs to analyze and understand the document's hierarchical structure.
Page Mapping: Maps the identified structure to the actual page numbers within the PDF.
Validation and Correction: Validates the accuracy of the extraction results and automatically corrects any errors.
Output Generation: Finally, it generates the standardized JSON output.
Usage and API Support:
You can use PageIndex either via a command-line interface for direct processing or through its Web API for more integrated solutions. The API supports multiple LLM providers, including DeepSeek (default), OpenAI GPT, Anthropic Claude, and Google Gemini.

I've also focused on production environment deployment, providing recommendations for Gunicorn, Docker, and Docker Compose, along with considerations for robust task status storage using Redis.

This project aims to provide a robust and flexible solution for programmatically understanding and structuring PDF documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API server projected Pageindex #22

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

API server projected Pageindex #22

Uh oh!

lichman0405 Jun 4, 2025

Replies: 0 comments

lichman0405
Jun 4, 2025