Skip to content

Commit bd860c1

Browse files
authored
New Tags, Retries, New Eval, Truncation (#17)
* --reduced-test * edits * submission * tests * required changes
1 parent 4633edd commit bd860c1

File tree

8 files changed

+828
-207
lines changed

8 files changed

+828
-207
lines changed

eval/dabstep.py

Lines changed: 136 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
from dataclasses import dataclass
55
from pathlib import Path
66
import random
7+
from tqdm import tqdm
8+
79

810
from datasets import load_dataset, concatenate_datasets
911
from open_data_scientist.codeagent import ReActDataScienceAgent
@@ -28,101 +30,120 @@ def process_task(task, submit, data_dir: str | None = None):
2830

2931
PROMPT = f"""You are an expert data analyst tasked with answering factoid questions by analyzing the following dataset files:
3032
31-
AVAILABLE FILES:
32-
/app/downloaded_data/data/context/acquirer_countries.csv
33-
/app/downloaded_data/data/context/fees.json
34-
/app/downloaded_data/data/context/manual.md
35-
/app/downloaded_data/data/context/merchant_category_codes.csv
36-
/app/downloaded_data/data/context/merchant_data.json
37-
/app/downloaded_data/data/context/payments-readme.md
38-
/app/downloaded_data/data/context/payments.csv
39-
40-
IMPORTANT: Always use the full/absolute paths shown above to access files. Relative paths will not work.
41-
42-
ANALYSIS PROCESS:
43-
1) CRITICAL FIRST STEP: You MUST thoroughly read and internalize the manual.md file COMPLETELY before proceeding.
44-
- The manual contains domain-specific definitions that are ESSENTIAL for correct interpretation
45-
- Terms like "fees", "transactions", and other concepts have specialized meanings in this context
46-
- Misunderstanding these definitions will GUARANTEE incorrect answers
47-
- Create a mental model of how all concepts interrelate based on the manual's definitions
48-
- Pay special attention to any hierarchical relationships, exclusions, or conditional statements
49-
50-
2) When reading the question, map it back to the exact terminology and definitions from the manual
51-
- Do NOT rely on your general knowledge about these terms
52-
- The correct answer depends on using the EXACT definitions from the manual
53-
- Identify which specific section of the manual is most relevant to the question
54-
55-
3) FOR COMPLEX MULTI-STEP QUESTIONS: Break down the question into logical sub-components
56-
- Identify all the specific filters needed (merchant names, time periods, fee IDs, etc.)
57-
- Determine the sequence of operations required (filter → calculate → aggregate → compare)
58-
- For hypothetical scenarios (e.g., "what if fee changed to X"), clearly identify:
59-
* Current state calculation
60-
* Hypothetical state calculation
61-
* Delta/difference calculation
62-
- For time-based questions, ensure you understand the exact date ranges and formatting
63-
- For merchant-specific questions, verify exact merchant name matching (case-sensitive)
64-
- For fee-related questions, distinguish between fee applicability vs. fee amounts vs. fee calculations
65-
66-
4) Next, read the payments-readme.md file to understand the payment data structure and relevant terminology.
67-
68-
5) For each additional file you need to access:
69-
- For CSV files: Check the column headers first to understand the data structure
70-
- For JSON files: Examine the schema by looking at a small sample (first few entries)
71-
- For text/markdown files: Read through the entire content for critical information
72-
73-
6) When working with large files, start by understanding their structure before attempting to process all the data.
74-
75-
7) Data validation and quality checks:
76-
- Check for missing values, duplicates, or data inconsistencies
77-
- Verify data types match expectations (strings, numbers, dates, etc.)
78-
- Look for outliers or anomalies that might affect your analysis
79-
- Cross-reference data between files to ensure consistency
80-
81-
8) VERIFICATION STEP: Before finalizing your answer, always:
82-
- Re-read the relevant sections of the manual to confirm your interpretation
83-
- Double-check your calculations and data aggregations
84-
- For multi-step calculations, verify each intermediate result makes sense
85-
- For time-based filtering, confirm you're using the correct date format and range
86-
- For merchant-specific queries, verify exact name matches
87-
- For fee calculations, confirm you're applying the right fee rules and formulas
88-
- Verify your answer makes logical sense given the context
89-
- Ensure you're answering the exact question asked (not a related but different question)
90-
91-
QUESTION TO ANSWER:
92-
{question}
93-
94-
ANSWER GUIDELINES:
95-
{guidelines}
96-
97-
CRITICAL REQUIREMENTS:
98-
- Be precise with numerical answers (include appropriate decimal places, units, etc.)
99-
- If calculations are involved, show your work clearly step-by-step
100-
- For complex multi-step problems, show all intermediate calculations
101-
- If the answer requires aggregation, explicitly state what you're aggregating
102-
- For categorical answers, use exact terminology from the manual/data
103-
- If data is missing or incomplete, state this clearly rather than guessing
104-
- For hypothetical scenarios, clearly distinguish current vs. hypothetical calculations
105-
- STRICTLY ADHERE TO THE GUIDELINES for formatting your output
106-
107-
FINAL ANSWER FORMAT:
108-
After your analysis, provide your final answer in the exact format specified in the ANSWER GUIDELINES. You might want to generate the formatted answer in python first and then copy the formatted answer to your Final Answer section.
109-
110-
If you encounter any errors accessing files or processing data, clearly state what went wrong rather than providing a guess.
33+
AVAILABLE FILES:
34+
/app/downloaded_data/data/context/acquirer_countries.csv
35+
/app/downloaded_data/data/context/fees.json
36+
/app/downloaded_data/data/context/manual.md
37+
/app/downloaded_data/data/context/merchant_category_codes.csv
38+
/app/downloaded_data/data/context/merchant_data.json
39+
/app/downloaded_data/data/context/payments-readme.md
40+
/app/downloaded_data/data/context/payments.csv
41+
42+
IMPORTANT: Always use the full/absolute paths shown above to access files. Relative paths will not work.
43+
44+
ANALYSIS PROCESS:
45+
1) CRITICAL FIRST STEP: You MUST thoroughly read and internalize the manual.md file COMPLETELY before proceeding.
46+
- The manual contains domain-specific definitions that are ESSENTIAL for correct interpretation
47+
- Terms like "fees", "transactions", and other concepts have specialized meanings in this context
48+
- Misunderstanding these definitions will GUARANTEE incorrect answers
49+
- Create a mental model of how all concepts interrelate based on the manual's definitions
50+
- Pay special attention to any hierarchical relationships, exclusions, or conditional statements
51+
52+
2) When reading the question, map it back to the exact terminology and definitions from the manual
53+
- Do NOT rely on your general knowledge about these terms
54+
- The correct answer depends on using the EXACT definitions from the manual
55+
- Identify which specific section of the manual is most relevant to the question
56+
57+
3) FOR COMPLEX MULTI-STEP QUESTIONS: Break down the question into logical sub-components
58+
- Identify all the specific filters needed (merchant names, time periods, fee IDs, etc.)
59+
- Determine the sequence of operations required (filter → calculate → aggregate → compare)
60+
- For hypothetical scenarios (e.g., "what if fee changed to X"), clearly identify:
61+
* Current state calculation
62+
* Hypothetical state calculation
63+
* Delta/difference calculation
64+
- For time-based questions, ensure you understand the exact date ranges and formatting
65+
- For merchant-specific questions, verify exact merchant name matching (case-sensitive)
66+
- For fee-related questions, distinguish between fee applicability vs. fee amounts vs. fee calculations
67+
68+
4) Next, read the payments-readme.md file to understand the payment data structure and relevant terminology.
69+
70+
5) For each additional file you need to access:
71+
- For CSV files: Check the column headers first to understand the data structure
72+
- For JSON files: Examine the schema by looking at a small sample (first few entries)
73+
- For text/markdown files: Read through the entire content for critical information
74+
75+
6) When working with large files, start by understanding their structure before attempting to process all the data.
76+
77+
7) Data validation and quality checks:
78+
- Check for missing values, duplicates, or data inconsistencies
79+
- Verify data types match expectations (strings, numbers, dates, etc.)
80+
- Look for outliers or anomalies that might affect your analysis
81+
- Cross-reference data between files to ensure consistency
82+
83+
8) VERIFICATION STEP: Before finalizing your answer, always:
84+
- Re-read the relevant sections of the manual to confirm your interpretation
85+
- Double-check your calculations and data aggregations
86+
- For multi-step calculations, verify each intermediate result makes sense
87+
- For time-based filtering, confirm you're using the correct date format and range
88+
- For merchant-specific queries, verify exact name matches
89+
- For fee calculations, confirm you're applying the right fee rules and formulas
90+
- Verify your answer makes logical sense given the context
91+
- Ensure you're answering the exact question asked (not a related but different question)
92+
93+
ANALYTICAL GUIDELINES:
94+
- When asked to find values across multiple applicable rules or data points, ensure you include ALL relevant items in your analysis
95+
- Do not arbitrarily select a single item when multiple items apply
96+
- Count all matching items, not just the first one found
97+
- Pay special attention to null values in rule definitions - they mean "applies to all values" not "no match"
98+
- When filtering rules, be less restrictive rather than more restrictive
99+
- Consider that some entities may not have specific rules and may use default/fallback rules
100+
- Verify your rule matching logic by checking if you're finding reasonable numbers of applicable rules
101+
- When you find 0 applicable rules, reconsider your filtering criteria - this often indicates overly restrictive logic
102+
- When processing multiple data points, verify that you're including all relevant items
103+
- When comparing options across different characteristics, ensure you're using the correct rules for each option
104+
- Don't assume that the lowest calculated value is automatically the correct answer - verify the rules actually apply
105+
- Consider all relevant characteristics when determining rule applicability
106+
- Cross-reference rules with actual data to ensure realistic scenarios
107+
108+
QUESTION TO ANSWER:
109+
{question}
110+
111+
ANSWER GUIDELINES:
112+
{guidelines}
113+
114+
CRITICAL REQUIREMENTS:
115+
- Be precise with numerical answers (include appropriate decimal places, units, etc.)
116+
- If calculations are involved, show your work clearly step-by-step
117+
- For complex multi-step problems, show all intermediate calculations
118+
- If the answer requires aggregation, explicitly state what you're aggregating
119+
- For categorical answers, use exact terminology from the manual/data
120+
- If data is missing or incomplete, state this clearly rather than guessing
121+
- For hypothetical scenarios, clearly distinguish current vs. hypothetical calculations
122+
- STRICTLY ADHERE TO THE GUIDELINES for formatting your output
123+
124+
FINAL ANSWER FORMAT:
125+
After your analysis, provide your final answer in the exact format specified in the ANSWER GUIDELINES. You might want to generate the formatted answer in python first and then copy the formatted answer to your Final Answer section.
126+
127+
If you encounter any errors accessing files or processing data, clearly state what went wrong rather than providing a guess.
111128
"""
112129

113-
print(f"Processing question: {question[:50]}...")
130+
print(f"Processing question: {question[:100]}...")
114131

115-
agent = ReActDataScienceAgent(executor="internal")
132+
agent = ReActDataScienceAgent(executor="internal", max_iterations=30)
116133

117134
try:
118135
llm_answer = agent.run(PROMPT)
136+
137+
# Extract reasoning traces from agent history
138+
reasoning_trace = json.dumps(agent.history)
139+
119140
except Exception as e:
120141
print(f"Task {tid} generated an exception: {e}")
121142
llm_answer = "Error: Task failed with exception"
143+
reasoning_trace = f"Error occurred: {str(e)}"
122144

123-
reasoning_trace = "Not available"
124-
125-
if not submit:
145+
# Always compute correctness if answer is available, regardless of submit flag
146+
if "answer" in task:
126147
answer = task["answer"]
127148
is_correct = (
128149
answer == llm_answer
@@ -193,27 +214,34 @@ def main(
193214

194215
number_of_examples = len(dataset)
195216
results = []
217+
196218
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
197219
future_to_task = {
198220
executor.submit(process_task, task, submit, data_dir): task
199221
for task in dataset
200222
}
201-
for future in concurrent.futures.as_completed(future_to_task):
202-
try:
203-
result = future.result()
204-
results.append(result)
205-
print(f"Completed task: {result.is_correct}")
206-
except Exception as e:
207-
# This should rarely happen now since exceptions are handled in process_task
208-
task = future_to_task[future]
209-
print(f"Unexpected error in task execution: {e}")
223+
224+
with tqdm(total=number_of_examples, desc="Processing tasks") as pbar:
225+
for future in concurrent.futures.as_completed(future_to_task):
226+
try:
227+
result = future.result()
228+
results.append(result)
229+
status = "✓" if result.is_correct else "✗"
230+
pbar.set_postfix_str(f"Task {result.tid}: {status}")
231+
pbar.update(1)
232+
except Exception as e:
233+
# This should rarely happen now since exceptions are handled in process_task
234+
task = future_to_task[future]
235+
pbar.set_postfix_str(f"Task {task['task_id']}: ERROR - {str(e)[:30]}...")
236+
pbar.update(1)
210237

211238
if submit:
212239
results_to_submit = [
213240
{
214241
"task_id": result.tid,
215242
"agent_answer": str(result.llm_answer),
216243
"reasoning_trace": str(result.reasoning_trace),
244+
"correct_answer": str(result.answer) if result.answer is not None else None,
217245
}
218246
for result in results
219247
]
@@ -225,6 +253,7 @@ def main(
225253
"task_id": task["task_id"],
226254
"agent_answer": "Error",
227255
"reasoning_trace": "skipped",
256+
"correct_answer": task.get("answer"),
228257
}
229258
)
230259

@@ -233,10 +262,19 @@ def main(
233262
Path(__file__).parent.parent / "submissions" / "DABstep" / "results.jsonl",
234263
)
235264

236-
correct_count = sum(r.is_correct for r in results if r.is_correct is not None)
237-
print(
238-
f"\nResults: {correct_count}/{number_of_examples} correct answers ({correct_count / number_of_examples * 100:.2f}%)"
239-
)
265+
# Calculate results for tasks that have answers available
266+
tasks_with_answers = [r for r in results if r.is_correct is not None]
267+
correct_count = sum(r.is_correct for r in tasks_with_answers)
268+
total_with_answers = len(tasks_with_answers)
269+
270+
if total_with_answers > 0:
271+
print(
272+
f"\nResults: {correct_count}/{total_with_answers} correct answers ({correct_count / total_with_answers * 100:.2f}%)"
273+
)
274+
if total_with_answers < number_of_examples:
275+
print(f"Note: {number_of_examples - total_with_answers} tasks did not have answer keys available")
276+
else:
277+
print(f"\nProcessed {number_of_examples} tasks (no answer keys available for accuracy calculation)")
240278

241279

242280
if __name__ == "__main__":

eval/results/dabstep_submission_v2.jsonl

Lines changed: 450 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)