Skip to content

Inbaselvan-ayyanar/GovDocChatBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GovDocChatBot

๐Ÿ“„ AI-Powered PDF Chatbot for Government Document Access4

Overview

GovDocChatBot is a smart chatbot solution designed to process government documents (PDFs) and enable users to ask natural language questions and get accurate answers based only on the content of those documents. It enhances public access, data transparency, and efficiency in retrieving government information.

It uses OCR, vector embeddings, LLMs, and retrieval-augmented generation (RAG) techniques to ensure fast and context-aware responses.

๐Ÿ” Features

โœ… PDF text and image (OCR) extraction

โœ… Recursive character splitting for chunk management

โœ… Vector embedding storage using ChromaDB

โœ… AI-powered Q&A via LLM (Mistral through Ollama)

โœ… Context-aware search with cross-encoder re-ranking

โœ… Seamless retrieval from large datasets of government documents

๐Ÿ“ฆ Prerequisites

Python 3.10 or later

Ollama installed with Mistral and embedding model

Required Python libraries:

chromadb, pymupdf, ollama, langchain, sentence_transformers, flask

โš™๏ธ Installation

Clone the repository git clone https://github.com/Inbaselvan-ayyanar/GovDocChatBot.git

cd GovDocChatBot

Install dependencies pip install -r requirements.txt

Run Ollama and download models

ollama pull mistral

ollama pull nomic-embed-text

๐Ÿš€ Usage

Prepare PDF documents

Place your PDF files in the input directory or specify their path.

Train the system (Optional if already trained)

python Train.py

Run the chatbot: python flask4.py

Ask questions through the web interface or API

The chatbot will return responses based on the document content.

โš™๏ธ Configuration

Edit constants in the scripts:

CHROMA_DB_DIR: Location for vector database storage.

model: Ollama model (default is "mistral")

embedding_function: Uses "nomic-embed-text" by default.

๐Ÿ”ง Troubleshooting

โ— Ensure that Ollama is running and both models are pulled.

โ— Check that documents are clean and readable for proper OCR.

โ— If answers seem off, re-train with fresh document ingestion using Train.py.

๐Ÿ“ฌ Contact

For queries or support, contact: [email protected]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published