Can Language Models Resolve SRE Tasks?
SRE-skills-bench evaluates LLMs on tasks commonly performed by Site Reliability Engineers, helping reliability practitioners choose the right model for the job, whether it's powering IDE assistants, automating operational workflows, or improving incident response. Think of SRE-skills-bench as the SWE-bench of Site Reliability Engineering.
Read our latest findings with Gemini 3 Pro on our blog post.
At the Rootly AI Labs, we run SRE-skills-bench on frontier models the day they are released, and we share our findings on our social media platforms (LinkedIn, X). We also present our benchmarks at leading ML research conferences, including as workshop papers at NeurIPS, ICML, and ACL.
The table below represents the average accuracy of each model across all SRE-related tasks included in the benchmark.
👉 Visit our website sreskillsbench.com to have access to all the findings.
- [Dec. 2, 2025]: presenting our work at ER – NeurIPS in San Diego, USA.
- [Nov. 24, 2025]: released ~3,000 new tasks testing LLMs on compute, network, and storage actions across AWS, GCP, and Azure.
- [Jul. 27, 2025]: presented our work at KnowFM – ACL 2025 in Vienna, Austria.
- [Jul. 19, 2025]: presented our work at New In ML – ICML 2025 in Vancouver, Canada.
To reproduce our results or use our benchmark to benchmark other models.
This project uses mise for tool version management.
# Install mise (if not already installed)
curl https://mise.run | sh
# Install required tools
mise trust
mise install# Create a virtual environment and install OpenBench
uv venv
source .venv/bin/activate
uv pip install openbench
# Set your API key (any provider!)
export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
# Run Rootly's benchmark
bench eval gmcq --model "groq/llama-3.1-8b-instant" --T subtask=mastodonTo run evaluations across all SRE tasks using the automation script:
# Copy the example env file and add your API keys
cp scripts/.env.example scripts/.env
# Edit scripts/.env with your API keys
# Run the evaluation script
cd scripts
./run-all-sre-skills-bench-tasks.shResults will be saved to a timestamped CSV file.
SRE-skills-bench evaluates models on tasks that represent real, day-to-day SRE responsibilities.
Each task category includes multiple test cases with expected outputs, graded programmatically or via structured evaluation. For each test, we open-source 40% of the entire dataset, available on our HF repo 🤗.
GMCQ evaluates a model's ability to understand code changes during pull requests, which can assist SREs during rapid responses to critical incidents. GMCQ's dataset consists of real-world pull requests and code diffs from six popular GitHub repositories that actively publish new version releases. Each question consists of a real pull request's issue description, and four choices of real code diffs, all sourced from the same repository. Only one code diff corresponds to that specific pull request, and the model must be able to identify the correct code diff. To achieve a strong performance on this benchmark, the model must be capable of understanding code functionality when given textual instructions and limited context.
This GMCQ benchmark was presented by the Rootly AI Labs as a workshop paper at ICML 2025 and ACL 2025.
This benchmark evaluates a model's ability to understand common code requests for SREs. Each question in this benchmark provides the model with a specific request and presents 4 code diffs of Terraform code and instructions that resolve similar requests. The model must select the correct choice of code diff.
This benchmark contains a wide array of scenarios, including compute, network, Kubernetes, and security requests on AWS, GCP, and Azure. For a model to perform well on this benchmark, it must be able to demonstrate a generalizable understanding of SRE requests across a wide array of tasks and target platforms, making this benchmark relevant to determine relevant models that can assist SREs in their day-to-day work.
This benchmark evaluates a model's ability to generate executable Terraform code from natural language prompts. Unlike the multiple-choice Terraform SRE Benchmark, this tests end-to-end code generation and execution.
Key Features:
- 11 real-world Terraform tasks covering VPC, EC2, S3, IAM, and Security Groups
- Full Terraform lifecycle testing (fmt, init, validate, plan, apply, destroy)
- LocalStack integration for safe, reproducible testing without real AWS resources
- Comprehensive reporting with failure categorization (SYNTAX, INIT, VALIDATE, PLAN, APPLY, etc.)
- Multi-provider LLM support (OpenAI, Anthropic, OpenRouter)
Usage:
# Install dependencies
uv pip install -e ".[terraform-generation]"
# Start LocalStack (required)
docker compose up -d
# Set API keys
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export OPENROUTER_API_KEY=your_key # Optional
# Run benchmark
./scripts/run-terraform-generation-bench.sh
# Or use the CLI directly
python -m terraform_generation_bench.benchmark_cli suite \
--models models.json \
--tasks all \ # note: you may need to remove this line of code during runtime
--runs-per-model 1Findings:
- Most models (99%) successfully generate code, but fail during Terraform execution
- Common failure points: INIT (33%), SYNTAX (21%), VALIDATE (13%)
- Top performers: DeepSeek Chat (27%), Mistral Large (18%), Llama 3 70B (18%)
- Only 25% of models pass at least one task across all 11 scenarios
This benchmark complements the existing Terraform SRE Benchmark by testing code generation rather than code understanding, providing a more comprehensive evaluation of LLM capabilities for SRE tasks.
SRE-skills-bench is built with ❤️ by the Rootly AI Labs for engineering teams everywhere. The Rootly AI Labs is a fellow-led community designed to redefine reliability engineering. We develop innovative prototypes, create open-source tools, and produce research that's shared to advance the standards of operational excellence. We want to thank Anthropic, Google Cloud, and Google DeepMind for their support.





