OfficeQA is a benchmark by Databricks, built for evaluating model / agent performance on end to end Grounded Reasoning tasks.
Additional details:
- Questions require the U.S Treasury Bulletin documents to answer
- OfficeQA contains 246 questions & corresponding ground truth answers.
- Datasets released under CC-BY-SA 4.0 and code and scripts under Apache 2.0 License.
OfficeQA evaluates how well AI systems can reason over real-world documents to answer complex questions. The benchmark uses historical U.S. Treasury Bulletin PDFs (1939-2025), which contain dense financial tables, charts, and text data.
Repository Contents:
officeqa.csv- The benchmark dataset with 246 questionsreward.py- Evaluation script for scoring model outputstreasury_bulletin_pdfs/- Original source PDF documents (696 files, ~20GB)treasury_bulletins_parsed/- Parsed and transformed versions (see more details below)
Dataset Schema (officeqa.csv):
| Column | Description |
|---|---|
uid |
Unique question identifier |
question |
The question to answer |
answer |
Ground truth answer |
source_docs |
Original URL(s) from the Federal Reserve Archive |
source_files |
Corresponding parsed filename(s) (e.g., treasury_bulletin_1941_01.txt) |
difficulty |
easy or hard |
git clone https://github.com/databricks/officeqa.git
cd officeqaNOTE: This may take a long time due to the large amount of PDF documents in treasury_bulletin_pdfs.
import pandas as pd
df = pd.read_csv('officeqa.csv')
print(f"Total questions: {len(df)}")
print(f"Easy: {len(df[df['difficulty'] == 'easy'])}")
print(f"Hard: {len(df[df['difficulty'] == 'hard'])}")We provide the Treasury Bulletin corpus in multiple formats:
The raw PDF documents as downloaded from the Federal Reserve Archive. Use these if your system can process PDFs directly.
- 696 PDF files covering 1939-2025
- Total size: ~20GB
Pre-parsed versions of the PDFs. The files are distributed as zip archives to stay within Git file size limits.
Structure:
treasury_bulletins_parsed/
├── jsons/ # Parsed JSON files with full structure
│ ├── treasury_bulletins_parsed_part001.zip
│ ├── treasury_bulletins_parsed_part002.zip
│ └── treasury_bulletins_parsed_part003.zip
├── transformed/ # Agent-friendly text format
│ └── treasury_bulletins_transformed.zip
├── unzip.py # Script to extract all files
└── transform_parsed_files.py # Script to transform JSON → text
To extract the files:
cd treasury_bulletins_parsed
python unzip.pyThis will extract:
jsons/*.json- Full parsed documents with bounding boxes, tables as HTML, and element metadatatransformed/*.txt- Simplified text format with tables converted to Markdown (more readable for LLMs)
Note that the script transform_parsed_files.py is what was used to convert the files from jsons/ into the files in transformed/. This script is provided for convenience in case you wish to modify the ways in which files are transformed for agent consumption from the parsed documents.
| Format | Best for | Size |
|---|---|---|
| PDFs | Systems with native PDF support, or you want to parse from scratch | ~20GB |
| Parsed JSON | Full structural information, coordinates | ~600MB |
| Transformed TXT | LLM/agent consumption, cleaner text | ~200MB |
The source_files column in officeqa.csv provides the direct filenames (e.g., treasury_bulletin_1941_01.txt) for easy reference. If you need to understand the URL-to-filename conversion logic, here's how it works:
URL format: https://fraser.stlouisfed.org/title/treasury-bulletin-407/{MONTH}-{YEAR}-{ID}?page={PAGE}
Filename format: treasury_bulletin_{YEAR}_{MONTH_NUM}.{ext}
Month name to number mapping:
january → 01 july → 07
february → 02 august → 08
march → 03 september → 09
april → 04 october → 10
may → 05 november → 11
june → 06 december → 12
Example:
- URL:
https://fraser.stlouisfed.org/title/treasury-bulletin-407/january-1941-6529 - JSON file:
treasury_bulletins_parsed/jsons/treasury_bulletin_1941_01.json - Text file:
treasury_bulletins_parsed/transformed/treasury_bulletin_1941_01.txt - PDF file:
treasury_bulletin_pdfs/treasury_bulletin_1941_01.pdf
from reward import score_answer
# Score a single prediction
score = score_answer(
ground_truth="123.45",
prediction="123.45",
tolerance=0.01 # 1% tolerance for numerical answers
)
print(f"Score: {score}") # 1.0 for correct, 0.0 for incorrectThe reward.py script provides fuzzy matching for numerical answers with configurable tolerance levels:
0.0%- Exact match0.1%- Within 0.1% relative error1.0%- Within 1% relative error5.0%- Within 5% relative error etc.
