|
1 | 1 | # Evalbuff |
2 | 2 |
|
3 | | -Evalbuff is an automated system that iteratively improves a coding agent's performance by optimizing project documentation. It runs overnight, discovers what an agent gets wrong, writes docs to fix those gaps, and keeps only the changes that measurably improve scores. |
| 3 | +Evalbuff improves a coding agent's performance by iteratively optimizing project documentation. It watches an agent fail, writes docs to fix the pattern, and keeps only the changes that measurably help. |
4 | 4 |
|
5 | | -## The Idea |
| 5 | +## Two Modes |
6 | 6 |
|
7 | | -Most coding agents read project documentation before making changes. Better docs lead to better code. But writing good docs is hard — you don't know what an agent needs to know until you watch it fail. |
| 7 | +### 1. Commit Learning Mode (default) |
8 | 8 |
|
9 | | -Evalbuff closes this loop automatically: |
| 9 | +Walks through your repo's git history commit-by-commit, using each commit as a learning opportunity: |
10 | 10 |
|
11 | | -1. **Run** a coding agent on real eval tasks (reconstructing git commits) |
12 | | -2. **Judge** the output with AI judges that apply living quality criteria |
13 | | -3. **Analyze** failures — feed the judge's weaknesses to a doc-writer agent |
14 | | -4. **Test** whether a proposed doc edit actually improves the agent's score |
15 | | -5. **Keep** doc changes that help, revert ones that don't |
16 | | -6. **Repeat** until the budget runs out or scores plateau |
| 11 | +1. Start at HEAD~500 (configurable) and process commits one at a time, oldest first |
| 12 | +2. For each commit, craft a human-like prompt that vaguely describes the change (via LLM) |
| 13 | +3. Run N agents in parallel (default 5) on that prompt against the parent commit |
| 14 | +4. Judge all runs — using the actual commit diff as ground truth |
| 15 | +5. Always analyze failures and propose doc changes (ensuring they're generic enough to help future tasks, not just this one) |
| 16 | +6. Re-run N agents with the proposed docs |
| 17 | +7. If scores improve, keep the docs and try to propose more improvements |
| 18 | +8. If scores don't improve, reject the docs and move to the next commit |
| 19 | +9. State is saved after each commit — resume at any time |
17 | 20 |
|
18 | | -The result: a `docs/` directory and `AGENTS.md` table of contents that encode exactly what the agent needs to know to perform well on your codebase. Any agent that reads project docs benefits — Claude Code, Codex, Codebuff, or anything else with a CLI. |
| 21 | +The result: a `docs/` directory that encodes patterns the agent needs to know, learned from real historical changes. |
19 | 22 |
|
20 | | -## Why Documentation? |
| 23 | +### 2. Prompt Mode |
21 | 24 |
|
22 | | -We chose documentation as the improvement lever because: |
| 25 | +Run a specific coding prompt and improve docs for it — no git history needed: |
23 | 26 |
|
24 | | -- **Agent-agnostic.** Every modern coding agent reads project docs. Improving docs improves all agents, not just one. |
25 | | -- **Interpretable.** Unlike fine-tuning weights or tweaking system prompts, docs are human-readable. You can review what evalbuff learned and decide if it makes sense. |
26 | | -- **Composable.** Doc improvements stack. A doc about error handling patterns doesn't conflict with a doc about naming conventions. |
27 | | -- **Persistent.** Docs live in the repo and benefit every future session, not just the current one. |
| 27 | +1. Given a prompt describing a coding task |
| 28 | +2. Run N agents in parallel on the prompt against the current HEAD |
| 29 | +3. Judge all runs — no ground truth, relies entirely on e2e testing by the judge |
| 30 | +4. Analyze and propose doc changes |
| 31 | +5. Re-run and keep/reject as with learn mode |
28 | 32 |
|
29 | | -## Living Quality Criteria |
30 | | - |
31 | | -Evalbuff uses a leveling system so it doesn't try to optimize everything at once: |
| 33 | +Useful for targeted doc improvement around known pain points. |
32 | 34 |
|
33 | | -| Level | Criteria Added | When | |
34 | | -|-------|---------------|------| |
35 | | -| L1 | Correctness, Completeness, Basic Style | Start | |
36 | | -| L2 | + Pattern Consistency | After L1 avg >= 8.0 over 10 tasks | |
37 | | -| L3 | + Test Quality | After L2 avg >= 8.0 over 10 tasks | |
38 | | -| L4 | + Optimal Design | After L3 avg >= 8.0 over 10 tasks | |
39 | | -| L5 | + Fluency | After L4 avg >= 8.0 over 10 tasks | |
40 | | - |
41 | | -This prevents the system from penalizing an agent for style issues when it can't even get the code to compile. Criteria are injected directly into the AI judge prompts. |
42 | | - |
43 | | -## Architecture |
| 35 | +## How It Works |
44 | 36 |
|
45 | 37 | ``` |
46 | | -┌─────────────────────────────────────────────────────┐ |
47 | | -│ Orchestrator │ |
48 | | -│ (run-evalbuff.ts) │ |
49 | | -│ │ |
50 | | -│ for each eval task: │ |
51 | | -│ 1. Clone repo into isolated temp dir │ |
52 | | -│ 2. Copy current docs/ into the clone │ |
53 | | -│ 3. Run agent CLI on the task prompt │ |
54 | | -│ 4. Judge the diff against ground truth │ |
55 | | -│ 5. If score < threshold: │ |
56 | | -│ a. Analyze failure → propose doc edit │ |
57 | | -│ b. Re-run agent with new doc │ |
58 | | -│ c. Re-judge → keep doc if score improved │ |
59 | | -│ 6. Update criteria level if scores are high │ |
60 | | -│ 7. Log entry to JSONL, save state │ |
61 | | -│ │ |
62 | | -│ Generate morning report │ |
63 | | -└─────────────────────────────────────────────────────┘ |
| 38 | +for each task (commit or prompt): |
| 39 | + ┌─────────────────────────────────────────────────────┐ |
| 40 | + │ 1. Run N agents in parallel (baseline) │ |
| 41 | + │ 2. Judge all N runs → average score │ |
| 42 | + │ 3. Analyze worst run → propose generic doc │ |
| 43 | + │ 4. Apply doc to repo │ |
| 44 | + │ 5. Re-run N agents with new doc │ |
| 45 | + │ 6. Score improved? Keep doc, try more improvements │ |
| 46 | + │ Score same/worse? Reject doc, next task │ |
| 47 | + └─────────────────────────────────────────────────────┘ |
64 | 48 | ``` |
65 | 49 |
|
66 | | -### Components |
67 | | - |
68 | | -| File | Role | |
69 | | -|------|------| |
70 | | -| `run-evalbuff.ts` | Main orchestrator loop with budget caps and resumable state | |
71 | | -| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent command, captures git diff | |
72 | | -| `judge.ts` | AI judging system (GPT-5.1 + Gemini) with criteria injection | |
73 | | -| `docs-optimizer.ts` | Failure analysis, doc writing, doc application, score comparison | |
74 | | -| `criteria.ts` | Living quality criteria with L1-L5 promotion logic | |
75 | | -| `morning-report.ts` | Generates markdown summary from overnight JSONL log | |
76 | | -| `test-repo-utils.ts` | Creates isolated git repos per eval task | |
77 | | -| `agent-runner.ts` | BuffBench-style agent runner (for Codebuff SDK agents) | |
78 | | -| `types.ts` | Shared types (EvalCommitV2, EvalDataV2, etc.) | |
| 50 | +Key design decisions: |
| 51 | +- **Low-cost agent** (`codebuff --agent base2-free` by default) — runs many times cheaply |
| 52 | +- **N parallel runs** for statistical significance — one run is noisy, five gives a decent signal |
| 53 | +- **Always analyze** — no score threshold; every task is a learning opportunity |
| 54 | +- **Generic docs only** — the doc writer is instructed to skip task-specific advice and focus on patterns |
| 55 | +- **Iterative improvement** — keeps proposing docs until one is rejected, then moves on |
79 | 56 |
|
80 | 57 | ## Usage |
81 | 58 |
|
82 | | -### Command Line |
| 59 | +### Commit Learning Mode |
83 | 60 |
|
84 | 61 | ```bash |
85 | 62 | bun run evalbuff/src/run-evalbuff.ts \ |
86 | 63 | --repo /path/to/target-repo \ |
87 | | - --agent "claude -p" \ |
88 | | - --evals evals/buffbench/eval-codebuff.json,evals/buffbench/eval-manifold.json \ |
89 | | - --max-iterations 50 \ |
90 | | - --max-cost 50 \ |
91 | | - --score-threshold 7.0 \ |
92 | | - --agent-timeout 300000 |
| 64 | + --agent "codebuff --agent base2-free" \ |
| 65 | + --commits 500 \ |
| 66 | + --parallelism 5 \ |
| 67 | + --max-cost 100 |
93 | 68 | ``` |
94 | 69 |
|
95 | | -Or via the workspace script: |
| 70 | +### Prompt Mode |
96 | 71 |
|
97 | 72 | ```bash |
98 | | -bun run --filter @codebuff/evalbuff run -- \ |
| 73 | +bun run evalbuff/src/run-evalbuff.ts \ |
99 | 74 | --repo /path/to/target-repo \ |
100 | | - --agent "codex exec --full-auto" \ |
101 | | - --evals evals/buffbench/eval-codebuff.json |
| 75 | + --agent "codebuff --agent base2-free" \ |
| 76 | + --prompt "Add a dark mode toggle to the settings page" \ |
| 77 | + --parallelism 5 |
102 | 78 | ``` |
103 | 79 |
|
104 | 80 | ### Arguments |
105 | 81 |
|
106 | 82 | | Argument | Default | Description | |
107 | 83 | |----------|---------|-------------| |
108 | 84 | | `--repo` | required | Path to the target repo where docs/ will be written | |
109 | | -| `--agent` | required | Agent CLI command (prompt is appended as last arg) | |
110 | | -| `--evals` | required | Comma-separated paths to eval JSON files | |
111 | | -| `--max-iterations` | 50 | Stop after this many tasks | |
112 | | -| `--max-cost` | 50 | Stop after spending this many USD (estimated) | |
113 | | -| `--score-threshold` | 7.0 | Only attempt doc edits for scores below this | |
114 | | -| `--agent-timeout` | 300000 | Per-task agent timeout in ms (5 min default) | |
| 85 | +| `--agent` | `codebuff --agent base2-free` | Agent CLI command (prompt appended as last arg) | |
| 86 | +| `--prompt` | — | If set, runs in prompt mode instead of learn mode | |
| 87 | +| `--commits` | 500 | How many commits back to start from (learn mode) | |
| 88 | +| `--parallelism` | 5 | Number of agents to run in parallel per task | |
| 89 | +| `--max-cost` | 100 | Stop after spending this many USD (estimated) | |
| 90 | +| `--agent-timeout` | 300000 | Per-agent timeout in ms (5 min default) | |
| 91 | +| `--init-command` | — | Command to run in each test repo (e.g., `npm install`) | |
115 | 92 | | `--criteria` | auto | Path to criteria JSON (auto-created if omitted) | |
| 93 | +| `--reviewers` | `claude,codex` | Comma-separated reviewer agent types | |
116 | 94 |
|
117 | | -### Overnight Run |
| 95 | +### Resuming |
118 | 96 |
|
119 | | -For an overnight run, set generous limits and let it go: |
| 97 | +State is saved to `evalbuff-state.json` in the target repo after each commit. Re-running with the same `--repo` automatically resumes from where it left off — it knows which commit was last processed and continues from there. |
| 98 | + |
| 99 | +### Overnight Run |
120 | 100 |
|
121 | 101 | ```bash |
122 | 102 | nohup bun run evalbuff/src/run-evalbuff.ts \ |
123 | 103 | --repo /path/to/repo \ |
124 | | - --agent "claude -p" \ |
125 | | - --evals evals/buffbench/eval-codebuff.json \ |
126 | | - --max-iterations 200 \ |
127 | | - --max-cost 100 \ |
| 104 | + --commits 500 \ |
| 105 | + --parallelism 5 \ |
| 106 | + --max-cost 200 \ |
128 | 107 | > evalbuff-overnight.log 2>&1 & |
129 | 108 | ``` |
130 | 109 |
|
131 | | -Check results in the morning: |
132 | | -- `<repo>/evalbuff-report-YYYY-MM-DD.md` — morning report |
133 | | -- `<repo>/evalbuff-log.jsonl` — detailed per-task log |
134 | | -- `<repo>/docs/` — the docs that were kept |
135 | | -- `<repo>/AGENTS.md` — table of contents |
136 | | - |
137 | | -### Resumable |
138 | | - |
139 | | -Evalbuff saves state to `evalbuff-state.json` in the target repo. If interrupted, re-running with the same arguments will skip completed tasks and continue where it left off. |
140 | | - |
141 | | -## How It Decides What Docs to Write |
142 | | - |
143 | | -When an agent scores below the threshold on a task, evalbuff: |
144 | | - |
145 | | -1. **Feeds the judge's weaknesses** to a doc-writer LLM agent |
146 | | -2. The doc writer sees: the task prompt, ground truth diff, agent's diff, judge analysis, and all current docs |
147 | | -3. It produces a **targeted doc file** — specific to the gap between what the agent did and what it should have done |
148 | | -4. The doc is written to `docs/<suggested-path>.md` and `AGENTS.md` is updated |
149 | | - |
150 | | -The doc writer is instructed to be specific and actionable — referencing concrete file paths, function names, and patterns. Generic advice like "follow best practices" is explicitly rejected. |
151 | | - |
152 | 110 | ## What Gets Produced |
153 | 111 |
|
154 | | -After a run, the target repo will contain: |
155 | | - |
156 | 112 | ``` |
157 | 113 | target-repo/ |
158 | | -├── docs/ |
| 114 | +├── docs/ # Generated documentation |
159 | 115 | │ ├── patterns/ |
160 | | -│ │ └── error-handling.md # Evalbuff-generated |
| 116 | +│ │ └── error-handling.md |
161 | 117 | │ ├── conventions/ |
162 | | -│ │ └── naming.md # Evalbuff-generated |
| 118 | +│ │ └── naming.md |
163 | 119 | │ └── architecture/ |
164 | | -│ └── data-flow.md # Evalbuff-generated |
165 | | -├── AGENTS.md # Table of contents |
166 | | -├── evalbuff-state.json # Resumable state |
167 | | -├── evalbuff-log.jsonl # Per-task log |
168 | | -├── evalbuff-criteria.json # Current criteria level |
169 | | -└── evalbuff-report-2026-03-25.md # Morning report |
| 120 | +│ └── data-flow.md |
| 121 | +├── AGENTS.md # Table of contents |
| 122 | +├── evalbuff-state.json # Resumable state (last commit SHA) |
| 123 | +├── evalbuff-log.jsonl # Per-task log |
| 124 | +├── evalbuff-criteria.json # Current criteria level |
| 125 | +└── evalbuff-report-2026-03-26.md # Report |
170 | 126 | ``` |
171 | 127 |
|
172 | | -### Morning Report |
173 | | - |
174 | | -The morning report includes: |
175 | | -- Summary table (iterations, cost, duration, score deltas) |
176 | | -- Doc changes table (which docs were tried, score impact, kept/reverted) |
177 | | -- Error log |
178 | | -- Score trajectory visualization |
179 | | - |
180 | | -## Eval Data Format |
181 | | - |
182 | | -Evalbuff reuses BuffBench's `EvalDataV2` format. Eval tasks are real git commits from open source repos, turned into prompts: |
183 | | - |
184 | | -```json |
185 | | -{ |
186 | | - "repoUrl": "https://github.com/org/repo", |
187 | | - "evalCommits": [ |
188 | | - { |
189 | | - "id": "task-abc123", |
190 | | - "sha": "abc123", |
191 | | - "parentSha": "def456", |
192 | | - "prompt": "Add error handling to the API endpoint...", |
193 | | - "fileDiffs": [{ "path": "src/api.ts", "diff": "..." }], |
194 | | - "supplementalFiles": ["src/types.ts"] |
195 | | - } |
196 | | - ] |
197 | | -} |
198 | | -``` |
199 | | - |
200 | | -Generate new evals with BuffBench's eval generation tools, then point evalbuff at the JSON files. |
| 128 | +## Living Quality Criteria |
201 | 129 |
|
202 | | -## Relationship to BuffBench |
| 130 | +Judges use a leveling system to avoid over-optimizing prematurely: |
203 | 131 |
|
204 | | -BuffBench benchmarks agents against each other. Evalbuff improves a single agent's performance over time. |
| 132 | +| Level | Criteria Added | Promotion | |
| 133 | +|-------|---------------|-----------| |
| 134 | +| L1 | Builds, tests pass, basic completeness | Start | |
| 135 | +| L2 | + Feature works E2E, logs clean | After L1 avg >= 8.0 over 10 tasks | |
| 136 | +| L3 | + Edge cases, UI verification | After L2 avg >= 8.0 | |
| 137 | +| L4 | + Cross-component integration, performance | After L3 avg >= 8.0 | |
| 138 | +| L5 | + Production readiness | After L4 avg >= 8.0 | |
205 | 139 |
|
206 | | -| | BuffBench | Evalbuff | |
207 | | -|---|-----------|----------| |
208 | | -| **Goal** | Compare agents | Improve an agent | |
209 | | -| **Output** | Scores + rankings | Documentation | |
210 | | -| **Loop** | Single pass | Iterative | |
211 | | -| **Judges** | 3 (GPT, Gemini, Claude) | 2 (GPT, Gemini) | |
212 | | -| **Agent coupling** | Codebuff SDK | Any CLI agent | |
| 140 | +## Architecture |
213 | 141 |
|
214 | | -Evalbuff was deep-copied from BuffBench and modified — they share types and eval data format but are independent codebases. |
| 142 | +| File | Role | |
| 143 | +|------|------| |
| 144 | +| `run-evalbuff.ts` | Main orchestrator — learn mode + prompt mode | |
| 145 | +| `commit-task-generator.ts` | Extract tasks from git history, generate prompts from commits | |
| 146 | +| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent, captures diff | |
| 147 | +| `judge.ts` | AI judging with/without ground truth, multi-reviewer aggregation | |
| 148 | +| `docs-optimizer.ts` | Failure analysis, generic doc writing, doc application/revert | |
| 149 | +| `criteria.ts` | Living quality criteria with L1-L5 promotion | |
| 150 | +| `morning-report.ts` | Report generation from JSONL log | |
| 151 | +| `test-repo-utils.ts` | Isolated git repo lifecycle management | |
0 commit comments