SRE-skills-bench

Can Language Models Resolve SRE Tasks?

SRE-skills-bench evaluates LLMs on tasks commonly performed by Site Reliability Engineers, helping reliability practitioners choose the right model for the job, whether it's powering IDE assistants, automating operational workflows, or improving incident response. Think of SRE-skills-bench as the SWE-bench of Site Reliability Engineering.

Read our latest findings with Gemini 3 Pro on our blog post.

At the Rootly AI Labs, we run SRE-skills-bench on frontier models the day they are released, and we share our findings on our social media platforms (LinkedIn, X). We also present our benchmarks at leading ML research conferences, including as workshop papers at NeurIPS, ICML, and ACL.

Findings

The table below represents the average accuracy of each model across all SRE-related tasks included in the benchmark.

Model	SRE-skills-bench score	Ouput Token Cost (per M)	Run Date
gemini-3.1-pro 🏆	98.8%	$12.00	Feb. 19, 2026
gpt-5.5 (high)	98.3%	$30.00	April 26, 2026
gpt-5.4	98.3%	$15.00	Mar 5, 2026
opus-4.7 (high)	98.2%	$25.00	Apr. 16, 2026
gpt-5.3-codex	98.03%	$14.00	Mar. 13, 2026
gemini-3-pro	96.7%	$12.00	Feb. 17, 2026
gpt-5.2-pro	96.5%	$168.00 💸	Feb. 17, 2026
kimi-k2.5	95.9%	$2.20	Feb. 17, 2026
opus-4.6	94.7%	$25.00	Feb. 17, 2026
opus-4.5	94.6%	$25.00	Feb. 17, 2026
gpt-5.1	93.3%	$14.00	Feb. 17, 2026
gpt-5.1-codex-max	92.5%	$10.00	Dec. 10, 2025
sonnet-4.6	90.4%	$15.00	Feb. 17, 2026
gpt-5.1	89.6%	$10.00	Nov. 24, 2025

👉 Visit our website sreskillsbench.com to have access to all the findings.

📰 News

[Dec. 2, 2025]: presenting our work at ER – NeurIPS in San Diego, USA.
[Nov. 24, 2025]: released ~3,000 new tasks testing LLMs on compute, network, and storage actions across AWS, GCP, and Azure.
[Jul. 27, 2025]: presented our work at KnowFM – ACL 2025 in Vienna, Austria.
[Jul. 19, 2025]: presented our work at New In ML – ICML 2025 in Vancouver, Canada.

Getting Started

To reproduce our results or use our benchmark to benchmark other models.

Prerequisites

This project uses mise for tool version management.

# Install mise (if not already installed)
curl https://mise.run | sh

# Install required tools
mise trust
mise install

Running the Benchmark

# Create a virtual environment and install OpenBench
uv venv
source .venv/bin/activate
uv pip install openbench

# Set your API key (any provider!)
export GROQ_API_KEY=your_key  # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

# Run Rootly's benchmark
bench eval gmcq --model "groq/llama-3.1-8b-instant" --T subtask=mastodon

Running the Automation Script

To run evaluations across all SRE tasks using the automation script:

# Copy the example env file and add your API keys
cp scripts/.env.example scripts/.env
# Edit scripts/.env with your API keys

# Run the evaluation script
cd scripts
./run-all-sre-skills-bench-tasks.sh

Results will be saved to a timestamped CSV file.

Methodology

SRE-skills-bench evaluates models on tasks that represent real, day-to-day SRE responsibilities.
Each task category includes multiple test cases with expected outputs, graded programmatically or via structured evaluation. For each test, we open-source 40% of the entire dataset, available on our HF repo 🤗.

GitHub Multiple Choice Questions Benchmark (GMCQ)

GMCQ evaluates a model's ability to understand code changes during pull requests, which can assist SREs during rapid responses to critical incidents. GMCQ's dataset consists of real-world pull requests and code diffs from six popular GitHub repositories that actively publish new version releases. Each question consists of a real pull request's issue description, and four choices of real code diffs, all sourced from the same repository. Only one code diff corresponds to that specific pull request, and the model must be able to identify the correct code diff. To achieve a strong performance on this benchmark, the model must be capable of understanding code functionality when given textual instructions and limited context.

This GMCQ benchmark was presented by the Rootly AI Labs as a workshop paper at ICML 2025 and ACL 2025.

Terraform SRE Benchmark

This benchmark evaluates a model's ability to understand common code requests for SREs. Each question in this benchmark provides the model with a specific request and presents 4 code diffs of Terraform code and instructions that resolve similar requests. The model must select the correct choice of code diff.

This benchmark contains a wide array of scenarios, including compute, network, Kubernetes, and security requests on AWS, GCP, and Azure. For a model to perform well on this benchmark, it must be able to demonstrate a generalizable understanding of SRE requests across a wide array of tasks and target platforms, making this benchmark relevant to determine relevant models that can assist SREs in their day-to-day work.

Terraform Generation Benchmark

This benchmark evaluates a model's ability to generate executable Terraform code from natural language prompts. Unlike the multiple-choice Terraform SRE Benchmark, this tests end-to-end code generation and execution.

Key Features:

11 real-world Terraform tasks covering VPC, EC2, S3, IAM, and Security Groups
Full Terraform lifecycle testing (fmt, init, validate, plan, apply, destroy)
LocalStack integration for safe, reproducible testing without real AWS resources
Comprehensive reporting with failure categorization (SYNTAX, INIT, VALIDATE, PLAN, APPLY, etc.)
Multi-provider LLM support (OpenAI, Anthropic, OpenRouter)

Usage:

# Install dependencies
uv pip install -e ".[terraform-generation]"

# Start LocalStack (required)
docker compose up -d

# Set API keys
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export OPENROUTER_API_KEY=your_key  # Optional

# Run benchmark
./scripts/run-terraform-generation-bench.sh

# Or use the CLI directly
python -m terraform_generation_bench.benchmark_cli suite \
  --models models.json \
  --tasks all \           # note: you may need to remove this line of code during runtime
  --runs-per-model 1

Findings:

Most models (99%) successfully generate code, but fail during Terraform execution
Common failure points: INIT (33%), SYNTAX (21%), VALIDATE (13%)
Top performers: DeepSeek Chat (27%), Mistral Large (18%), Llama 3 70B (18%)
Only 25% of models pass at least one task across all 11 scenarios

This benchmark complements the existing Terraform SRE Benchmark by testing code generation rather than code understanding, providing a more comprehensive evaluation of LLM capabilities for SRE tasks.

🔗 About the Rootly AI Labs

SRE-skills-bench is built with ❤️ by the Rootly AI Labs for engineering teams everywhere. The Rootly AI Labs is a fellow-led community designed to redefine reliability engineering. We develop innovative prototypes, create open-source tools, and produce research that's shared to advance the standards of operational excellence. We want to thank Anthropic, Google Cloud, and Google DeepMind for their support.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github		.github
datasets/mcq_cleaned		datasets/mcq_cleaned
evals/rootly_terraform_clean		evals/rootly_terraform_clean
scripts		scripts
src		src
static		static
tasks/terraform_generation		tasks/terraform_generation
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TERRAFORM_GENERATION_BENCHMARK.md		TERRAFORM_GENERATION_BENCHMARK.md
docker-compose.yml		docker-compose.yml
mise.toml		mise.toml
models.json		models.json
models_archived.json		models_archived.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results_20251121_131811.csv		results_20251121_131811.csv
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRE-skills-bench

Findings

📰 News

Getting Started

Prerequisites

Running the Benchmark

Running the Automation Script

Methodology

GitHub Multiple Choice Questions Benchmark (GMCQ)

Terraform SRE Benchmark

Terraform Generation Benchmark

🔗 About the Rootly AI Labs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SRE-skills-bench

Findings

📰 News

Getting Started

Prerequisites

Running the Benchmark

Running the Automation Script

Methodology

GitHub Multiple Choice Questions Benchmark (GMCQ)

Terraform SRE Benchmark

Terraform Generation Benchmark

🔗 About the Rootly AI Labs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages