English | 中文说明
A lightweight RAG evaluation scaffold that runs the full pipeline — “dataset → vector store → RAG workflow → evaluation” — with a single command. It comes with a Streamlit-based lightweight visualization console, and exposes a decoupled vector-store management layer (VectorManager), RAG Runner layer, and evaluation layer, all connected via a unified invoke interface.
The default example ships with a Chinese QA dataset and a basic RAG workflow, plus configuration examples for model services (Qwen by default). The overall design is not tied to any single vendor; you can plug in any compatible chat and embedding service via configuration.
-
As a RAG / LLM engineering scaffold
For users with experimentation or engineering needs, this project helps you, under a unified protocol, to:- build and manage vector stores;
- orchestrate one or more RAG workflows that output a unified structure;
- call the evaluation engine to compare different workflows, datasets, and metrics.
-
As a 0-to-1 learning project for RAG
For users who lack end-to-end engineering experience, this project already stitches together the full pipeline “dataset → vector store → retrieval → generation → evaluation”.
You can start from the default implementation and gradually replace or extend modules, without constantly jumping between different component documentations.
-
One-line vector-store construction and VectorManager retrieval
Through a unified entry point, you can in a single line:- load the benchmark dataset according to configuration;
- build or update the vector store;
- obtain a VectorManager instance for retrieval, document CRUD, and retriever construction.
Configuration is centralized inconfig/application.yaml, including data paths, chunking strategy, vector-store persistence path, and more.
-
Full decoupling of dataset/vector layer, RAG layer, and evaluation layer
- The dataset/vector layer only handles “samples → chunks → vector store” and exposes VectorManager upwards.
- The RAG layer only cares about “how to use VectorManager for retrieval and call the LLM to form a workflow”, and exposes results via a Runner’s
invoke. - The evaluation layer depends only on the Runner protocol and sample format, and is agnostic to the underlying vector store and workflow implementation, making it easy to swap datasets, models, and pipelines.
-
Unified
invokeprotocol- The vector-store management tool provides retrieval via a unified interface (for example,
VectorManager().invoke(query: str, k: int = 5) -> List[Document]), focusing solely on “given a query, which documents are returned”; - All Runners implement a unified
Runner().invoke(question: str) -> dictinterface and return an agreed structure (question, generation, contexts, etc.); - The evaluation engine always starts evaluation via
EvalEngine().invoke(runner), which internally handles batch execution and concrete evaluation methods.
- The vector-store management tool provides retrieval via a unified interface (for example,
-
Low entry barrier and high extensibility
The default chunking logic and workflow structure are kept as straightforward as possible, so beginners can read and modify the code directly:- Beginners can start from the default chunking and workflow, tweak prompts, retrieval strategies, or add/remove simple nodes;
- Advanced users can completely ignore the default implementation, take a retriever from VectorManager, implement a protocol-compliant Runner (keeping the
invokesignature and output structure), and plug it directly into the evaluation layer to compare different RAG schemes.
-
Python version
Python 3.10 or above is recommended. -
Install dependencies
In the project root:
pip install -r requirements.txt
-
Configure models and API keys
- In the configuration file, select or declare the model provider and model names (chat model, embedding model, etc.);
- Set the corresponding API key in your system environment variables, for example:
On Windows (current session):
set API_KEY_XXX=your-api-key python quickstart.py
On Linux or macOS:
export API_KEY_XXX="your-api-key" python quickstart.py
The exact environment variable names and configuration examples are defined in
config/application.yamland related docs.
In the project root:
python quickstart.pyWith the web frontend:
streamlit run streamlit_app.pyThe default flow includes:
-
Build the benchmark dataset and vector store
- Build evaluation samples from raw data;
- Chunk the context fields according to configuration to produce chunks;
- Build or update the local vector store and persist it to the configured directory.
-
Load evaluation samples
Load normalized samples fromdataset.samples_path. -
Run the default Runner in batch RAG mode
The default Runner uses VectorManager internally for context retrieval and the configured model for answer generation,
and outputs fields likequestion,generation, andcontextsaccording to the agreed schema. -
Run the evaluation engine
- Normalize Runner outputs into standard record structures;
- Use RAGAS and other evaluation methods to compute global and per-sample metrics;
- Print results to the console and optionally export CSV files as configured.
With dependencies, configuration, and API keys set correctly, this single command runs the full loop from data to evaluation report.
This section focuses on concepts and interface contracts. See the example scripts in the repository for concrete usage.
A typical flow:
-
Use the unified entry
VectorDatabaseBuilder().invoke()to construct the vector store and obtain a VectorManager (internally loading configuration fromconfig/application.yamlby convention). -
Use VectorManager’s methods to:
- append new documents under conditions (
add_documents(...)); - delete or rebuild collections (
delete_collection(...)); - get a retriever or directly retrieve a list of documents (
get_retriever(...)orinvoke(...)); - combine with your own models to perform QA or other analysis.
- append new documents under conditions (
VectorManager hides the underlying vector-store implementation details, so RAG workflows only depend on a unified retrieval interface.
Example:
from rag_eval import VectorDatabaseBuilder
# Build the vector store according to configuration and get its manager (VectorManager)
vector_manager = VectorDatabaseBuilder().invoke()
# Use invoke for similarity search; returns a list of documents
docs = vector_manager.invoke("Which two companies co-developed Sengoku Musou 3?", k=5)If you want to use a custom dataset, please refer to rag_eval/dataset_tools/__init__.py.
Runner protocol (simplified):
class MyRunner:
def __init__(self, vector_manager, ...):
self.vector_manager = vector_manager
# other model, prompt, and workflow configs
def invoke(self, question: str) -> dict:
"""Return a structured result for the evaluation layer:
question: original question
generation: final model answer (string)
contexts: list of retrieved contexts used to answer this question
"""
...
return {
"question": question,
"generation": answer_text,
"contexts": context_list,
}The evaluation layer depends only on the input–output protocol of invoke, so:
- as long as the returned structure matches the contract, it can be evaluated;
- model choice, prompt composition, and whether you use multi-step workflows are all opaque to the evaluation layer.
Typical entry point (simplified):
from rag_eval import EvalEngine
runner = MyRunner(...)
eval_result = EvalEngine().invoke(runner)
eval_result.show_console(top_n=5)Internally, EvalEngine:
- Loads evaluation samples (optionally controlled by configuration, such as sample limits);
- Calls the Runner’s
invokein batch; - Invokes RAGAS and other metrics;
- Aggregates global and per-sample metrics and provides console and frontend display helpers.
To use the graphical interface for debugging or demo:
-
In the project root:
streamlit run streamlit_app.py
-
The frontend typically provides two modes:
- Evaluation mode
- Runs the evaluation flow under the current configuration and shows global metrics and per-sample scores;
- Lets you choose the number of samples and trigger evaluation, with progress and estimated time displayed on the page;
- Supports inspecting a single sample’s question, ground_truth, generation, and contexts.
- Chat mode
- Reuses the current Runner configuration for interactive RAG QA;
- Helps you qualitatively inspect retrieval and answer quality alongside quantitative evaluation results.
- Evaluation mode
The frontend does not change the underlying interface contracts. It simply wraps the existing Runner, evaluation engine, and results into an easy-to-use control panel.
The actual layout may evolve across versions; always refer to the repository for the latest structure. A typical layout:
config/
application.yaml # Global config: datasets, vector store, models, evaluation, etc.
agents.yaml # Model roles, prompts, agent configs, etc.
datasets/
raw/ # Raw datasets
processed/ # Processed samples / chunks
rag_eval/
__init__.py
core/ # Core types and interface definitions
__init__.py
interfaces.py
types.py
dataset_tools/ # Tools from raw to samples and chunks
__init__.py
cmrc2018/
... # CMRC2018-related scripts
embeddings/
__init__.py
factory.py # Embedding factory and wrappers
eval_engine/ # Batch execution and evaluation engine (EvalEngine, etc.)
__init__.py
engine.py
eval_result.py
rag_batch_runner.py
ragas_eval.py
rag/ # Default RAG workflow and Runner
__init__.py
normal_rag.py
runner.py
vector/ # Vector-store builder and VectorManager
__init__.py
vector_builder.py
vector_store_manager.py
utils/
... # Common utilities
quickstart.py # One-command end-to-end example
quickstart.ipynb # Corresponding notebook example (optional)
streamlit_app.py # Streamlit console entry
Main configuration lives in config/application.yaml, typically including:
- Dataset paths and sample size limits;
2. Vector-store backend config (persistence directory, collection name, embedding model, etc.);
3. Retrieval parameters (for example, top_k);
4. Model and API settings;
5. Evaluation output paths and evaluation parameters.
-
The current version already provides:
- full decoupling of dataset/vector layer, RAG layer, and evaluation layer;
- one-line vector-store construction and VectorManager-based management;
- a unified
invokeprotocol shared by Runners and the evaluation engine; - a one-command end-to-end run plus advanced customization entry points and improved frontend progress feedback;
- a Streamlit console for debugging and demo.
-
Planned extensions include:
- additional benchmark datasets and example Runners;
- more evaluation metrics and richer configuration for evaluation;
- more complete configuration templates and examples for different model services.
If you only want to quickly run an end-to-end RAG-plus-evaluation pipeline, start from quickstart.py.
If you care more about engineering and extensibility, start from VectorManager, the Runner protocol, and the Streamlit frontend, and gradually replace or extend the modules.