Terraform Generation Benchmark

Overview

The Terraform Generation Benchmark evaluates LLMs on their ability to generate executable Terraform code from natural language prompts. This benchmark tests end-to-end code generation and execution, complementing the existing multiple-choice Terraform SRE Benchmark.

Key Differences from Terraform SRE Benchmark

Aspect	Terraform SRE Benchmark	Terraform Generation Benchmark
Format	Multiple choice (4 code diffs)	Code generation from prompts
Testing	Selection accuracy	Full Terraform execution
Validation	Correct choice selection	Real resource creation in LocalStack
Scope	Understanding code	Generating working code

Architecture

src/terraform_generation_bench/
├── __init__.py
├── llm_client.py              # Multi-provider LLM client (OpenAI, Anthropic, OpenRouter)
├── terraform_generator.py     # Code extraction from LLM responses
├── benchmark.py               # Benchmark runner
├── benchmark_cli.py           # CLI interface
├── report_generator.py        # Report generation (JSON, HTML, Markdown)
└── runner/
    ├── __init__.py
    ├── run_task.py           # Terraform pipeline execution
    ├── checks.py             # Post-apply validation
    └── utils.py              # Utility functions

tasks/terraform_generation/
├── task_vpc_3subnets_3ec2/
│   ├── spec.yaml             # Task specification
│   └── prompt.txt            # LLM prompt template
├── task_ec2_instance_profile/
├── task_iam_role_policy/
├── task_s3_bucket_policy/
├── task_s3_cors_configuration/
├── task_s3_lifecycle_versioning/
├── task_s3_public_access_block/
├── task_security_group_complex/
├── task_vpc_internet_gateway/
├── task_vpc_multiple_route_tables/
└── task_vpc_nat_gateway/

Task Pipeline

Each task runs through the following pipeline:

Code Generation: LLM generates Terraform code from prompt
Format Check: terraform fmt -check
Initialize: terraform init
Validate: terraform validate
Plan: terraform plan
Apply: terraform apply (creates resources in LocalStack)
Post-Apply Checks: Verify resources exist and are correctly configured
Idempotency Check: Second terraform plan (should show no changes)
Destroy: terraform destroy (cleanup)

Results

Results are stored in results/<model>/<task>/<run>/:

benchmark_result.json: Complete benchmark results
check.json: Post-apply validation results
logs/: Terraform command logs

Reports are generated in reports/:

comprehensive.md: Overall model performance across all tasks
Individual task reports

Example Usage

# Run single benchmark
python -m terraform_generation_bench.benchmark_cli benchmark \
  --provider openai \
  --model gpt-4 \
  --task-id task_vpc_3subnets_3ec2

# Run benchmark suite
python -m terraform_generation_bench.benchmark_cli suite \
  --models models.json \
  --tasks all \
  --runs-per-model 1

# Generate comprehensive report
python -m terraform_generation_bench.benchmark_cli report \
  --format comprehensive \
  --output reports/comprehensive.md

Configuration

models.json

[
  {"provider": "openai", "model": "gpt-4"},
  {"provider": "anthropic", "model": "claude-3-5-sonnet-20241022"},
  {"provider": "openrouter", "model": "google/gemini-2.5-flash"}
]

Environment Variables

export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export OPENROUTER_API_KEY=your_key  # Optional, used as fallback

LocalStack Setup

LocalStack is required for safe testing. Start it with:

docker compose up -d

Verify it's running:

curl http://localhost:4566/_localstack/health

Findings Summary

Based on testing 16 models across 11 tasks:

99% of failures occur during Terraform execution (not code generation)
Most common failures: INIT (33%), SYNTAX (21%), VALIDATE (13%)
Top performers: DeepSeek Chat (27%), Mistral Large (18%), Llama 3 70B (18%)
Only 25% of models pass at least one task

This indicates that while LLMs can generate code, they struggle with:

Correct provider configuration
Valid Terraform syntax
Proper resource relationships and dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform Generation Benchmark

Overview

Key Differences from Terraform SRE Benchmark

Architecture

Task Pipeline

Failure Categories

Results

Example Usage

Configuration

models.json

Environment Variables

LocalStack Setup

Findings Summary

FilesExpand file tree

TERRAFORM_GENERATION_BENCHMARK.md

Latest commit

History

TERRAFORM_GENERATION_BENCHMARK.md

File metadata and controls

Terraform Generation Benchmark

Overview

Key Differences from Terraform SRE Benchmark

Architecture

Task Pipeline

Failure Categories

Results

Example Usage

Configuration

models.json

Environment Variables

LocalStack Setup

Findings Summary