Skip to content

Benchmarks

Joshua Davis edited this page Mar 31, 2026 · 1 revision

Benchmarks

Overview

The benchmark suite measures AI-generated code quality across 14 dimensions. It is project-agnostic and can be applied to any multi-stage build pipeline. Both GitHub Copilot and Claude Code are evaluated against identical input prompts produced by the az prototype build pipeline.

How Scores Are Generated

Both tools receive identical input prompts produced by the extension's prompt construction pipeline. GitHub Copilot processes these natively during a live build run. The same prompts are extracted from the debug log and submitted verbatim to Claude Code. Each tool's responses are scored independently against 14 benchmarks with 4-5 weighted sub-factors totaling 100 points each. Because both tools see the exact same input, score differences reflect genuine output quality differences.

14 Benchmarks

ID Name Description
B-INST Instruction Adherence Does the output implement exactly what was requested?
B-CNST Constraint Compliance Are NEVER/MUST/CRITICAL directives followed?
B-TECH Technical Correctness Is the code syntactically valid and deployable?
B-SEC Security Posture Are security best practices followed?
B-OPS Operational Readiness Are deploy scripts production-grade?
B-DEP Dependency Hygiene Are dependencies minimal and correctly versioned?
B-SCOPE Scope Discipline Does the output stay within requested boundaries?
B-QUAL Code Quality Is the code well-organized and maintainable?
B-OUT Output Completeness Are all required interfaces properly defined?
B-CONS Cross-Stage Consistency Are patterns uniform across all stages?
B-DOC Documentation Quality Are docs complete, accurate, and actionable?
B-REL Response Reliability Is the response complete and parseable?
B-RBAC RBAC & Identity Are identity/role patterns correct per service?
B-ANTI Anti-Pattern Absence Are known bad patterns absent from output?

Each benchmark has 4-5 weighted sub-factors. Full scoring rubrics are in benchmarks/README.md.

Testing Workflow

  1. Run az prototype build --debug via GitHub Copilot
  2. Extract stage prompts and responses from the debug log
  3. Submit each prompt to Claude Code and save the responses
  4. Score both response sets against the 14 benchmarks
  5. Generate reports

See benchmarks/INSTRUCTIONS.md for detailed steps, extraction scripts, and copy-paste analysis instructions.

Reports

File Purpose Updated
benchmarks/YYYY-MM-DD-HH-mm-ss.html Per-run benchmark report with stage tabs Every run
benchmarks/overall.html Trends dashboard with per-benchmark detail tabs On instruction only
benchmarks/YYYY-MM-DD_Benchmark_Report.pdf PDF report from TEMPLATE.docx with embedded charts On instruction only

Individual run reports may be generated at any time for testing. The trends dashboard and PDF are only updated when explicitly instructed.

PDF Generation

python scripts/generate_pdf.py

This populates benchmarks/TEMPLATE.docx with scores, generates 29 matplotlib charts (1 overall trend + 14 factor comparisons + 14 score trends), embeds them, converts to PDF via docx2pdf, and cleans up the temporary DOCX.

Score Ratings

Rating Range Action
Excellent 90-100 Monitor for regressions
Good 75-89 Review specific sub-criteria
Acceptable 60-74 Prioritize improvements
Poor 40-59 Investigate root causes
Failing 0-39 Immediate investigation required

Home

Getting Started

Stages

Interfaces

Configuration

Agent System

Features

Quality

Help

Governance

Policies — Azure

AI Services

Compute

Data Services

Identity

Management

Messaging

Monitoring

Networking

Security

Storage

Web & App

Policies — Well-Architected

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

Integration

Anti-Patterns
Standards

Application

IaC

Principles

Transforms

Clone this wiki locally