API server projected Pageindex #22
lichman0405
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What I've Been Working On: PageIndex - A PDF Document Structure Analyzer
https://github.com/lichman0405/pageindex_fastapi.git
I've been developing PageIndex, a powerful tool designed to automatically analyze PDF documents and extract their table of contents (TOC) structure. It leverages Large Language Models (LLMs) to understand document content and then generates a hierarchical JSON output representing the document's structure, including chapters, sub-chapters, and their corresponding page numbers.
This project is a decoupled version of Vectify AI's original PageIndex, with significant enhancements in logging capabilities.
Key Features:
Smart TOC Detection: PageIndex intelligently identifies TOC pages and extracts page numbers from PDFs.
Hierarchical Structure Generation: It builds a comprehensive, hierarchical structure of the document, making it easy to navigate.
LLM-Driven Analysis: The core of PageIndex relies on LLMs for accurate content understanding and structural analysis.
Flexible Output: You can choose to include node IDs, summaries, original text, and other information in the generated JSON.
Asynchronous Processing: The tool supports concurrent processing for improved efficiency.
Detailed Logging: I've integrated the Rich library for clear console output and detailed logs, which helps with monitoring and debugging.
Web API Support: PageIndex offers a RESTful API with asynchronous task processing, allowing for easy integration into other systems.
File Management: It supports online PDF uploads and subsequent download of the structured results.
How It Works:
PageIndex follows a robust workflow:
PDF Parsing: Extracts text content from PDFs using libraries like PyPDF2 and PyMuPDF.
TOC Detection: Smartly identifies the table of contents pages.
Structure Analysis: Uses LLMs to analyze and understand the document's hierarchical structure.
Page Mapping: Maps the identified structure to the actual page numbers within the PDF.
Validation and Correction: Validates the accuracy of the extraction results and automatically corrects any errors.
Output Generation: Finally, it generates the standardized JSON output.
Usage and API Support:
You can use PageIndex either via a command-line interface for direct processing or through its Web API for more integrated solutions. The API supports multiple LLM providers, including DeepSeek (default), OpenAI GPT, Anthropic Claude, and Google Gemini.
I've also focused on production environment deployment, providing recommendations for Gunicorn, Docker, and Docker Compose, along with considerations for robust task status storage using Redis.
This project aims to provide a robust and flexible solution for programmatically understanding and structuring PDF documents.
Beta Was this translation helpful? Give feedback.
All reactions