Name	Name	Last commit message	Last commit date
parent directory ..
tests	tests
README.md	README.md
__init__.py	__init__.py
__main__.py	__main__.py
cli.py	cli.py
dataset.py	dataset.py
eliza_agent.py	eliza_agent.py
evaluator.py	evaluator.py
pyproject.toml	pyproject.toml
runner.py	runner.py
types.py	types.py

Mind2Web Benchmark for ElizaOS

Web agent benchmark based on OSU-NLP-Group/Mind2Web.

Evaluates ElizaOS agents on real-world web navigation and interaction tasks.

Features

Canonical ElizaOS Integration: Uses runtime.message_service.handle_message() for the full agent loop
Multiple Model Providers: Groq (fast/cheap), OpenAI, Anthropic
Comprehensive Metrics: Task success, step accuracy, element accuracy, operation accuracy
Multiple Splits: Cross-Task, Cross-Website, Cross-Domain evaluation

Quick Start

Run with Sample Tasks (No API Key Required)

# From repo root
PYTHONPATH=packages python -m benchmarks.mind2web --sample

Run with Groq (Fast and Cheap)

# Set your Groq API key
export GROQ_API_KEY=your_key_here

# Run benchmark
PYTHONPATH=packages python -m benchmarks.mind2web --sample --real-llm --provider groq --model openai/gpt-oss-120b

Run with OpenAI

export OPENAI_API_KEY=your_key_here
PYTHONPATH=packages python -m benchmarks.mind2web --sample --real-llm --provider groq --model openai/gpt-oss-120b

Run Full Benchmark from HuggingFace

# Install datasets package
pip install datasets

# Run with HuggingFace data
PYTHONPATH=packages python -m benchmarks.mind2web --hf --real-llm --max-tasks 50

CLI Options

Usage: python -m benchmarks.mind2web [OPTIONS]

Data Source:
  --sample              Use built-in sample tasks (default)
  --hf                  Load from HuggingFace (requires datasets package)
  --split SPLIT         Dataset split: train, test_task, test_website, test_domain

Task Selection:
  --max-tasks N         Maximum tasks to run
  --trials N            Trials per task (default: 1)
  --max-steps N         Maximum steps per task (default: 20)

Model Configuration:
  --real-llm            Use real LLM via ElizaOS (requires API key)
  --provider PROVIDER   groq, openai, openrouter, anthropic, or auto (default)
  --model MODEL         Model name for OpenAI-compatible providers
  --temperature T       LLM temperature (default: 0.0)

Output:
  --output DIR          Output directory for results
  --json                Print results as JSON
  --verbose             Enable verbose logging

Evaluation Metrics

Metric	Description
Task Success Rate	Percentage of tasks where ALL steps are correct
Step Accuracy	Percentage of individual steps that are fully correct
Element Accuracy	Percentage of steps with correct target element
Operation Accuracy	Percentage of steps with correct operation (CLICK/TYPE/SELECT)

Dataset Splits

Split	Description
`test_task`	Cross-Task: Same websites, new task types
`test_website`	Cross-Website: New websites within same domains
`test_domain`	Cross-Domain: Entirely new domains

Architecture

Mind2Web Benchmark
├── eliza_agent.py     # ElizaOS agent with MIND2WEB_ACTION action
├── dataset.py         # Mind2Web dataset loader (HF + local + samples)
├── evaluator.py       # Step and task evaluation
├── runner.py          # Benchmark orchestration
├── cli.py             # Command-line interface
└── types.py           # Type definitions

Agent Flow

Provider (MIND2WEB_CONTEXT): Injects task instruction, current page elements, and action history
Action (MIND2WEB_ACTION): Executes browser operations (CLICK, TYPE, SELECT)
Evaluation: Compares predicted actions against ground truth

Example Output

============================================================
Mind2Web Benchmark Results
============================================================
Tasks: 3, Trials: 3
Task Success Rate: 66.7%
Step Accuracy: 85.0%
Element Accuracy: 90.0%
Avg Latency: 1234ms

Results saved to: ./benchmark_results/mind2web/2026-01-14_12-30-45
============================================================

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Type check
mypy benchmarks/mind2web

# Lint
ruff check benchmarks/mind2web

References

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Mind2Web Benchmark for ElizaOS

Features

Quick Start

Run with Sample Tasks (No API Key Required)

Run with Groq (Fast and Cheap)

Run with OpenAI

Run Full Benchmark from HuggingFace

CLI Options

Evaluation Metrics

Dataset Splits

Architecture

Agent Flow

Example Output

Development

References

License

FilesExpand file tree

mind2web

Directory actions

More options

Directory actions

More options

Latest commit

History

mind2web

Folders and files

parent directory

README.md

Mind2Web Benchmark for ElizaOS

Features

Quick Start

Run with Sample Tasks (No API Key Required)

Run with Groq (Fast and Cheap)

Run with OpenAI

Run Full Benchmark from HuggingFace

CLI Options

Evaluation Metrics

Dataset Splits

Architecture

Agent Flow

Example Output

Development

References

License