-
Notifications
You must be signed in to change notification settings - Fork 2
Benchmarks
The benchmark suite measures AI-generated code quality across 14 dimensions. It is project-agnostic and can be applied to any multi-stage build pipeline. Both GitHub Copilot and Claude Code are evaluated against identical input prompts produced by the az prototype build pipeline.
Both tools receive identical input prompts produced by the extension's prompt construction pipeline. GitHub Copilot processes these natively during a live build run. The same prompts are extracted from the debug log and submitted verbatim to Claude Code. Each tool's responses are scored independently against 14 benchmarks with 4-5 weighted sub-factors totaling 100 points each. Because both tools see the exact same input, score differences reflect genuine output quality differences.
| ID | Name | Description |
|---|---|---|
| B-INST | Instruction Adherence | Does the output implement exactly what was requested? |
| B-CNST | Constraint Compliance | Are NEVER/MUST/CRITICAL directives followed? |
| B-TECH | Technical Correctness | Is the code syntactically valid and deployable? |
| B-SEC | Security Posture | Are security best practices followed? |
| B-OPS | Operational Readiness | Are deploy scripts production-grade? |
| B-DEP | Dependency Hygiene | Are dependencies minimal and correctly versioned? |
| B-SCOPE | Scope Discipline | Does the output stay within requested boundaries? |
| B-QUAL | Code Quality | Is the code well-organized and maintainable? |
| B-OUT | Output Completeness | Are all required interfaces properly defined? |
| B-CONS | Cross-Stage Consistency | Are patterns uniform across all stages? |
| B-DOC | Documentation Quality | Are docs complete, accurate, and actionable? |
| B-REL | Response Reliability | Is the response complete and parseable? |
| B-RBAC | RBAC & Identity | Are identity/role patterns correct per service? |
| B-ANTI | Anti-Pattern Absence | Are known bad patterns absent from output? |
Each benchmark has 4-5 weighted sub-factors. Full scoring rubrics are in benchmarks/README.md.
- Run
az prototype build --debugvia GitHub Copilot - Extract stage prompts and responses from the debug log
- Submit each prompt to Claude Code and save the responses
- Score both response sets against the 14 benchmarks
- Generate reports
See benchmarks/INSTRUCTIONS.md for detailed steps, extraction scripts, and copy-paste analysis instructions.
| File | Purpose | Updated |
|---|---|---|
benchmarks/YYYY-MM-DD-HH-mm-ss.html |
Per-run benchmark report with stage tabs | Every run |
benchmarks/overall.html |
Trends dashboard with per-benchmark detail tabs | On instruction only |
benchmarks/YYYY-MM-DD_Benchmark_Report.pdf |
PDF report from TEMPLATE.docx with embedded charts | On instruction only |
Individual run reports may be generated at any time for testing. The trends dashboard and PDF are only updated when explicitly instructed.
python scripts/generate_pdf.pyThis populates benchmarks/TEMPLATE.docx with scores, generates 29 matplotlib charts (1 overall trend + 14 factor comparisons + 14 score trends), embeds them, converts to PDF via docx2pdf, and cleans up the temporary DOCX.
| Rating | Range | Action |
|---|---|---|
| Excellent | 90-100 | Monitor for regressions |
| Good | 75-89 | Review specific sub-criteria |
| Acceptable | 60-74 | Prioritize improvements |
| Poor | 40-59 | Investigate root causes |
| Failing | 0-39 | Immediate investigation required |
Getting Started
Stages
Interfaces
Configuration
Agent System
Features
- Backlog Generation
- Cost Analysis
- Error Analysis
- Docs & Spec Kit
- MCP Integration
- Knowledge System
- Escalation
Quality
Help
Policies — Azure
AI Services
Compute
Data Services
- Azure SQL
- Backup Vault
- Cosmos Db
- Data Factory
- Databricks
- Event Grid
- Event Hubs
- Fabric
- IoT Hub
- Mysql Flexible
- Postgresql Flexible
- Recovery Services
- Redis Cache
- Service Bus
- Stream Analytics
- Synapse Workspace
Identity
Management
Messaging
Monitoring
Networking
- Application Gateway
- Bastion
- CDN
- DDoS Protection
- DNS Zones
- Expressroute
- Firewall
- Load Balancer
- Nat Gateway
- Network Interface
- Private Endpoints
- Public Ip
- Route Tables
- Traffic Manager
- Virtual Network
- Vpn Gateway
- WAF Policy
Security
Storage
Web & App
Policies — Well-Architected
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
Integration