The Terraform Generation Benchmark evaluates LLMs on their ability to generate executable Terraform code from natural language prompts. This benchmark tests end-to-end code generation and execution, complementing the existing multiple-choice Terraform SRE Benchmark.
| Aspect | Terraform SRE Benchmark | Terraform Generation Benchmark |
|---|---|---|
| Format | Multiple choice (4 code diffs) | Code generation from prompts |
| Testing | Selection accuracy | Full Terraform execution |
| Validation | Correct choice selection | Real resource creation in LocalStack |
| Scope | Understanding code | Generating working code |
src/terraform_generation_bench/
├── __init__.py
├── llm_client.py # Multi-provider LLM client (OpenAI, Anthropic, OpenRouter)
├── terraform_generator.py # Code extraction from LLM responses
├── benchmark.py # Benchmark runner
├── benchmark_cli.py # CLI interface
├── report_generator.py # Report generation (JSON, HTML, Markdown)
└── runner/
├── __init__.py
├── run_task.py # Terraform pipeline execution
├── checks.py # Post-apply validation
└── utils.py # Utility functions
tasks/terraform_generation/
├── task_vpc_3subnets_3ec2/
│ ├── spec.yaml # Task specification
│ └── prompt.txt # LLM prompt template
├── task_ec2_instance_profile/
├── task_iam_role_policy/
├── task_s3_bucket_policy/
├── task_s3_cors_configuration/
├── task_s3_lifecycle_versioning/
├── task_s3_public_access_block/
├── task_security_group_complex/
├── task_vpc_internet_gateway/
├── task_vpc_multiple_route_tables/
└── task_vpc_nat_gateway/
Each task runs through the following pipeline:
- Code Generation: LLM generates Terraform code from prompt
- Format Check:
terraform fmt -check - Initialize:
terraform init - Validate:
terraform validate - Plan:
terraform plan - Apply:
terraform apply(creates resources in LocalStack) - Post-Apply Checks: Verify resources exist and are correctly configured
- Idempotency Check: Second
terraform plan(should show no changes) - Destroy:
terraform destroy(cleanup)
- SYNTAX: Terraform syntax errors
- INIT: Provider initialization failures
- VALIDATE: Configuration validation errors
- PLAN: Planning phase errors
- APPLY: Resource creation failures
- CHECKS: Post-apply validation failures
- IDEMPOTENCY: Second apply shows changes
- DESTROY: Cleanup failures
- Generation: LLM API errors
Results are stored in results/<model>/<task>/<run>/:
benchmark_result.json: Complete benchmark resultscheck.json: Post-apply validation resultslogs/: Terraform command logs
Reports are generated in reports/:
comprehensive.md: Overall model performance across all tasks- Individual task reports
# Run single benchmark
python -m terraform_generation_bench.benchmark_cli benchmark \
--provider openai \
--model gpt-4 \
--task-id task_vpc_3subnets_3ec2
# Run benchmark suite
python -m terraform_generation_bench.benchmark_cli suite \
--models models.json \
--tasks all \
--runs-per-model 1
# Generate comprehensive report
python -m terraform_generation_bench.benchmark_cli report \
--format comprehensive \
--output reports/comprehensive.md[
{"provider": "openai", "model": "gpt-4"},
{"provider": "anthropic", "model": "claude-3-5-sonnet-20241022"},
{"provider": "openrouter", "model": "google/gemini-2.5-flash"}
]export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export OPENROUTER_API_KEY=your_key # Optional, used as fallbackLocalStack is required for safe testing. Start it with:
docker compose up -dVerify it's running:
curl http://localhost:4566/_localstack/healthBased on testing 16 models across 11 tasks:
- 99% of failures occur during Terraform execution (not code generation)
- Most common failures: INIT (33%), SYNTAX (21%), VALIDATE (13%)
- Top performers: DeepSeek Chat (27%), Mistral Large (18%), Llama 3 70B (18%)
- Only 25% of models pass at least one task
This indicates that while LLMs can generate code, they struggle with:
- Correct provider configuration
- Valid Terraform syntax
- Proper resource relationships and dependencies