Skip to content

Commit f0636fc

Browse files
jahoomaclaude
andauthored
Rework evalbuff: commit learning, parallel agents, trace compression (#481)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e79c6a1 commit f0636fc

File tree

13 files changed

+1934
-968
lines changed

13 files changed

+1934
-968
lines changed

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,4 @@ Make an efficient learning agent that can do anything.
4242
- [`docs/environment-variables.md`](docs/environment-variables.md) — Env var rules, DI helpers, loading order
4343
- [`docs/agents-and-tools.md`](docs/agents-and-tools.md) — Agent system, shell shims, tool definitions
4444
- [`docs/patterns/handle-steps-generators.md`](docs/patterns/handle-steps-generators.md) — handleSteps generator patterns and spawn_agents tool calls
45+
- [docs/evalbuff/interpreting-task-prompts.md](docs/evalbuff/interpreting-task-prompts.md)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Interpreting Task Prompts (Especially Eval-Generated Ones)
2+
3+
When working with task prompts, especially those auto-generated from commit history for evaluation purposes, the prompt text may not accurately describe the actual work needed.
4+
5+
## The Problem
6+
7+
Evalbuff generates task prompts by analyzing commits. Sometimes the prompt will say "create documentation about X" when the actual ground truth is "fix test scripts in package.json and CI workflow files." This happens when:
8+
9+
1. The commit message is misleading (e.g., "Simplify AGENTS.md" when it actually removes test scripts)
10+
2. The prompt generator focuses on visible file additions rather than the semantic meaning of the change
11+
3. The task is stated in terms of what a developer might ASK for, not what they actually need
12+
13+
## Solution: Always Check Ground Truth First
14+
15+
Before implementing ANY task:
16+
17+
1. **Check if there's a ground truth diff available** - look for references to expected changes, test files, or "what should have been done"
18+
2. **Examine file paths and extensions in the ground truth**:
19+
- `.json` files (especially `package.json`) → likely config/dependency changes
20+
- `.yml`/`.yaml` files in `.github/workflows/` → CI/CD configuration changes
21+
- `.md` files → documentation (but could also be removing or editing existing docs)
22+
- `.ts`/`.js` files → code changes
23+
3. **Read the actual diff content, not just the prompt** - the diff shows EXACTLY what changed
24+
4. **Distinguish between creation vs. modification**:
25+
- Does the ground truth show `new file mode` or additions to existing files?
26+
- Is this refactoring, removal, or net-new functionality?
27+
28+
## Example: The AGENTS.md Confusion
29+
30+
Prompt said:
31+
> "Can you create an AGENTS.md file at the root that provides an overview..."
32+
33+
Ground truth showed:
34+
```diff
35+
--- a/.agents/package.json
36+
+++ b/.agents/package.json
37+
- "test:e2e": "bun test e2e"
38+
--- a/.github/workflows/nightly-e2e.yml
39+
+++ b/.github/workflows/nightly-e2e.yml
40+
- run: cd .agents && bun run test:e2e
41+
+ run: cd agents && bun run test:e2e
42+
```
43+
44+
The actual task was about:
45+
- Removing a test script from package.json
46+
- Fixing directory references in a CI workflow
47+
- NOT about creating documentation
48+
49+
The agent should have recognized the ground truth shows `.json` and `.yml` config files, not `.md` documentation files.
50+
51+
## When In Doubt
52+
53+
If the prompt seems to conflict with file paths/types in the ground truth:
54+
1. Trust the ground truth diff over the prompt text
55+
2. Read the actual file contents being changed
56+
3. Understand the PURPOSE of the change (fixing tests, updating config, refactoring) before implementing
57+
4. Ask clarifying questions if the task is genuinely ambiguous
58+
59+
## Red Flags
60+
61+
- Prompt says "create docs" but ground truth shows only config file changes → likely NOT a docs task
62+
- Prompt says "add feature X" but ground truth removes code → likely a cleanup/refactor task
63+
- Prompt uses vague language ("simplify", "improve") → read the diff to understand the specific technical change

evalbuff/README.md

Lines changed: 93 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -1,214 +1,151 @@
11
# Evalbuff
22

3-
Evalbuff is an automated system that iteratively improves a coding agent's performance by optimizing project documentation. It runs overnight, discovers what an agent gets wrong, writes docs to fix those gaps, and keeps only the changes that measurably improve scores.
3+
Evalbuff improves a coding agent's performance by iteratively optimizing project documentation. It watches an agent fail, writes docs to fix the pattern, and keeps only the changes that measurably help.
44

5-
## The Idea
5+
## Two Modes
66

7-
Most coding agents read project documentation before making changes. Better docs lead to better code. But writing good docs is hard — you don't know what an agent needs to know until you watch it fail.
7+
### 1. Commit Learning Mode (default)
88

9-
Evalbuff closes this loop automatically:
9+
Walks through your repo's git history commit-by-commit, using each commit as a learning opportunity:
1010

11-
1. **Run** a coding agent on real eval tasks (reconstructing git commits)
12-
2. **Judge** the output with AI judges that apply living quality criteria
13-
3. **Analyze** failures — feed the judge's weaknesses to a doc-writer agent
14-
4. **Test** whether a proposed doc edit actually improves the agent's score
15-
5. **Keep** doc changes that help, revert ones that don't
16-
6. **Repeat** until the budget runs out or scores plateau
11+
1. Start at HEAD~500 (configurable) and process commits one at a time, oldest first
12+
2. For each commit, craft a human-like prompt that vaguely describes the change (via LLM)
13+
3. Run N agents in parallel (default 5) on that prompt against the parent commit
14+
4. Judge all runs — using the actual commit diff as ground truth
15+
5. Always analyze failures and propose doc changes (ensuring they're generic enough to help future tasks, not just this one)
16+
6. Re-run N agents with the proposed docs
17+
7. If scores improve, keep the docs and try to propose more improvements
18+
8. If scores don't improve, reject the docs and move to the next commit
19+
9. State is saved after each commit — resume at any time
1720

18-
The result: a `docs/` directory and `AGENTS.md` table of contents that encode exactly what the agent needs to know to perform well on your codebase. Any agent that reads project docs benefits — Claude Code, Codex, Codebuff, or anything else with a CLI.
21+
The result: a `docs/` directory that encodes patterns the agent needs to know, learned from real historical changes.
1922

20-
## Why Documentation?
23+
### 2. Prompt Mode
2124

22-
We chose documentation as the improvement lever because:
25+
Run a specific coding prompt and improve docs for it — no git history needed:
2326

24-
- **Agent-agnostic.** Every modern coding agent reads project docs. Improving docs improves all agents, not just one.
25-
- **Interpretable.** Unlike fine-tuning weights or tweaking system prompts, docs are human-readable. You can review what evalbuff learned and decide if it makes sense.
26-
- **Composable.** Doc improvements stack. A doc about error handling patterns doesn't conflict with a doc about naming conventions.
27-
- **Persistent.** Docs live in the repo and benefit every future session, not just the current one.
27+
1. Given a prompt describing a coding task
28+
2. Run N agents in parallel on the prompt against the current HEAD
29+
3. Judge all runs — no ground truth, relies entirely on e2e testing by the judge
30+
4. Analyze and propose doc changes
31+
5. Re-run and keep/reject as with learn mode
2832

29-
## Living Quality Criteria
30-
31-
Evalbuff uses a leveling system so it doesn't try to optimize everything at once:
33+
Useful for targeted doc improvement around known pain points.
3234

33-
| Level | Criteria Added | When |
34-
|-------|---------------|------|
35-
| L1 | Correctness, Completeness, Basic Style | Start |
36-
| L2 | + Pattern Consistency | After L1 avg >= 8.0 over 10 tasks |
37-
| L3 | + Test Quality | After L2 avg >= 8.0 over 10 tasks |
38-
| L4 | + Optimal Design | After L3 avg >= 8.0 over 10 tasks |
39-
| L5 | + Fluency | After L4 avg >= 8.0 over 10 tasks |
40-
41-
This prevents the system from penalizing an agent for style issues when it can't even get the code to compile. Criteria are injected directly into the AI judge prompts.
42-
43-
## Architecture
35+
## How It Works
4436

4537
```
46-
┌─────────────────────────────────────────────────────┐
47-
│ Orchestrator │
48-
│ (run-evalbuff.ts) │
49-
│ │
50-
│ for each eval task: │
51-
│ 1. Clone repo into isolated temp dir │
52-
│ 2. Copy current docs/ into the clone │
53-
│ 3. Run agent CLI on the task prompt │
54-
│ 4. Judge the diff against ground truth │
55-
│ 5. If score < threshold: │
56-
│ a. Analyze failure → propose doc edit │
57-
│ b. Re-run agent with new doc │
58-
│ c. Re-judge → keep doc if score improved │
59-
│ 6. Update criteria level if scores are high │
60-
│ 7. Log entry to JSONL, save state │
61-
│ │
62-
│ Generate morning report │
63-
└─────────────────────────────────────────────────────┘
38+
for each task (commit or prompt):
39+
┌─────────────────────────────────────────────────────┐
40+
│ 1. Run N agents in parallel (baseline) │
41+
│ 2. Judge all N runs → average score │
42+
│ 3. Analyze worst run → propose generic doc │
43+
│ 4. Apply doc to repo │
44+
│ 5. Re-run N agents with new doc │
45+
│ 6. Score improved? Keep doc, try more improvements │
46+
│ Score same/worse? Reject doc, next task │
47+
└─────────────────────────────────────────────────────┘
6448
```
6549

66-
### Components
67-
68-
| File | Role |
69-
|------|------|
70-
| `run-evalbuff.ts` | Main orchestrator loop with budget caps and resumable state |
71-
| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent command, captures git diff |
72-
| `judge.ts` | AI judging system (GPT-5.1 + Gemini) with criteria injection |
73-
| `docs-optimizer.ts` | Failure analysis, doc writing, doc application, score comparison |
74-
| `criteria.ts` | Living quality criteria with L1-L5 promotion logic |
75-
| `morning-report.ts` | Generates markdown summary from overnight JSONL log |
76-
| `test-repo-utils.ts` | Creates isolated git repos per eval task |
77-
| `agent-runner.ts` | BuffBench-style agent runner (for Codebuff SDK agents) |
78-
| `types.ts` | Shared types (EvalCommitV2, EvalDataV2, etc.) |
50+
Key design decisions:
51+
- **Low-cost agent** (`codebuff --agent base2-free` by default) — runs many times cheaply
52+
- **N parallel runs** for statistical significance — one run is noisy, five gives a decent signal
53+
- **Always analyze** — no score threshold; every task is a learning opportunity
54+
- **Generic docs only** — the doc writer is instructed to skip task-specific advice and focus on patterns
55+
- **Iterative improvement** — keeps proposing docs until one is rejected, then moves on
7956

8057
## Usage
8158

82-
### Command Line
59+
### Commit Learning Mode
8360

8461
```bash
8562
bun run evalbuff/src/run-evalbuff.ts \
8663
--repo /path/to/target-repo \
87-
--agent "claude -p" \
88-
--evals evals/buffbench/eval-codebuff.json,evals/buffbench/eval-manifold.json \
89-
--max-iterations 50 \
90-
--max-cost 50 \
91-
--score-threshold 7.0 \
92-
--agent-timeout 300000
64+
--agent "codebuff --agent base2-free" \
65+
--commits 500 \
66+
--parallelism 5 \
67+
--max-cost 100
9368
```
9469

95-
Or via the workspace script:
70+
### Prompt Mode
9671

9772
```bash
98-
bun run --filter @codebuff/evalbuff run -- \
73+
bun run evalbuff/src/run-evalbuff.ts \
9974
--repo /path/to/target-repo \
100-
--agent "codex exec --full-auto" \
101-
--evals evals/buffbench/eval-codebuff.json
75+
--agent "codebuff --agent base2-free" \
76+
--prompt "Add a dark mode toggle to the settings page" \
77+
--parallelism 5
10278
```
10379

10480
### Arguments
10581

10682
| Argument | Default | Description |
10783
|----------|---------|-------------|
10884
| `--repo` | required | Path to the target repo where docs/ will be written |
109-
| `--agent` | required | Agent CLI command (prompt is appended as last arg) |
110-
| `--evals` | required | Comma-separated paths to eval JSON files |
111-
| `--max-iterations` | 50 | Stop after this many tasks |
112-
| `--max-cost` | 50 | Stop after spending this many USD (estimated) |
113-
| `--score-threshold` | 7.0 | Only attempt doc edits for scores below this |
114-
| `--agent-timeout` | 300000 | Per-task agent timeout in ms (5 min default) |
85+
| `--agent` | `codebuff --agent base2-free` | Agent CLI command (prompt appended as last arg) |
86+
| `--prompt` || If set, runs in prompt mode instead of learn mode |
87+
| `--commits` | 500 | How many commits back to start from (learn mode) |
88+
| `--parallelism` | 5 | Number of agents to run in parallel per task |
89+
| `--max-cost` | 100 | Stop after spending this many USD (estimated) |
90+
| `--agent-timeout` | 300000 | Per-agent timeout in ms (5 min default) |
91+
| `--init-command` || Command to run in each test repo (e.g., `npm install`) |
11592
| `--criteria` | auto | Path to criteria JSON (auto-created if omitted) |
93+
| `--reviewers` | `claude,codex` | Comma-separated reviewer agent types |
11694

117-
### Overnight Run
95+
### Resuming
11896

119-
For an overnight run, set generous limits and let it go:
97+
State is saved to `evalbuff-state.json` in the target repo after each commit. Re-running with the same `--repo` automatically resumes from where it left off — it knows which commit was last processed and continues from there.
98+
99+
### Overnight Run
120100

121101
```bash
122102
nohup bun run evalbuff/src/run-evalbuff.ts \
123103
--repo /path/to/repo \
124-
--agent "claude -p" \
125-
--evals evals/buffbench/eval-codebuff.json \
126-
--max-iterations 200 \
127-
--max-cost 100 \
104+
--commits 500 \
105+
--parallelism 5 \
106+
--max-cost 200 \
128107
> evalbuff-overnight.log 2>&1 &
129108
```
130109

131-
Check results in the morning:
132-
- `<repo>/evalbuff-report-YYYY-MM-DD.md` — morning report
133-
- `<repo>/evalbuff-log.jsonl` — detailed per-task log
134-
- `<repo>/docs/` — the docs that were kept
135-
- `<repo>/AGENTS.md` — table of contents
136-
137-
### Resumable
138-
139-
Evalbuff saves state to `evalbuff-state.json` in the target repo. If interrupted, re-running with the same arguments will skip completed tasks and continue where it left off.
140-
141-
## How It Decides What Docs to Write
142-
143-
When an agent scores below the threshold on a task, evalbuff:
144-
145-
1. **Feeds the judge's weaknesses** to a doc-writer LLM agent
146-
2. The doc writer sees: the task prompt, ground truth diff, agent's diff, judge analysis, and all current docs
147-
3. It produces a **targeted doc file** — specific to the gap between what the agent did and what it should have done
148-
4. The doc is written to `docs/<suggested-path>.md` and `AGENTS.md` is updated
149-
150-
The doc writer is instructed to be specific and actionable — referencing concrete file paths, function names, and patterns. Generic advice like "follow best practices" is explicitly rejected.
151-
152110
## What Gets Produced
153111

154-
After a run, the target repo will contain:
155-
156112
```
157113
target-repo/
158-
├── docs/
114+
├── docs/ # Generated documentation
159115
│ ├── patterns/
160-
│ │ └── error-handling.md # Evalbuff-generated
116+
│ │ └── error-handling.md
161117
│ ├── conventions/
162-
│ │ └── naming.md # Evalbuff-generated
118+
│ │ └── naming.md
163119
│ └── architecture/
164-
│ └── data-flow.md # Evalbuff-generated
165-
├── AGENTS.md # Table of contents
166-
├── evalbuff-state.json # Resumable state
167-
├── evalbuff-log.jsonl # Per-task log
168-
├── evalbuff-criteria.json # Current criteria level
169-
└── evalbuff-report-2026-03-25.md # Morning report
120+
│ └── data-flow.md
121+
├── AGENTS.md # Table of contents
122+
├── evalbuff-state.json # Resumable state (last commit SHA)
123+
├── evalbuff-log.jsonl # Per-task log
124+
├── evalbuff-criteria.json # Current criteria level
125+
└── evalbuff-report-2026-03-26.md # Report
170126
```
171127

172-
### Morning Report
173-
174-
The morning report includes:
175-
- Summary table (iterations, cost, duration, score deltas)
176-
- Doc changes table (which docs were tried, score impact, kept/reverted)
177-
- Error log
178-
- Score trajectory visualization
179-
180-
## Eval Data Format
181-
182-
Evalbuff reuses BuffBench's `EvalDataV2` format. Eval tasks are real git commits from open source repos, turned into prompts:
183-
184-
```json
185-
{
186-
"repoUrl": "https://github.com/org/repo",
187-
"evalCommits": [
188-
{
189-
"id": "task-abc123",
190-
"sha": "abc123",
191-
"parentSha": "def456",
192-
"prompt": "Add error handling to the API endpoint...",
193-
"fileDiffs": [{ "path": "src/api.ts", "diff": "..." }],
194-
"supplementalFiles": ["src/types.ts"]
195-
}
196-
]
197-
}
198-
```
199-
200-
Generate new evals with BuffBench's eval generation tools, then point evalbuff at the JSON files.
128+
## Living Quality Criteria
201129

202-
## Relationship to BuffBench
130+
Judges use a leveling system to avoid over-optimizing prematurely:
203131

204-
BuffBench benchmarks agents against each other. Evalbuff improves a single agent's performance over time.
132+
| Level | Criteria Added | Promotion |
133+
|-------|---------------|-----------|
134+
| L1 | Builds, tests pass, basic completeness | Start |
135+
| L2 | + Feature works E2E, logs clean | After L1 avg >= 8.0 over 10 tasks |
136+
| L3 | + Edge cases, UI verification | After L2 avg >= 8.0 |
137+
| L4 | + Cross-component integration, performance | After L3 avg >= 8.0 |
138+
| L5 | + Production readiness | After L4 avg >= 8.0 |
205139

206-
| | BuffBench | Evalbuff |
207-
|---|-----------|----------|
208-
| **Goal** | Compare agents | Improve an agent |
209-
| **Output** | Scores + rankings | Documentation |
210-
| **Loop** | Single pass | Iterative |
211-
| **Judges** | 3 (GPT, Gemini, Claude) | 2 (GPT, Gemini) |
212-
| **Agent coupling** | Codebuff SDK | Any CLI agent |
140+
## Architecture
213141

214-
Evalbuff was deep-copied from BuffBench and modified — they share types and eval data format but are independent codebases.
142+
| File | Role |
143+
|------|------|
144+
| `run-evalbuff.ts` | Main orchestrator — learn mode + prompt mode |
145+
| `commit-task-generator.ts` | Extract tasks from git history, generate prompts from commits |
146+
| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent, captures diff |
147+
| `judge.ts` | AI judging with/without ground truth, multi-reviewer aggregation |
148+
| `docs-optimizer.ts` | Failure analysis, generic doc writing, doc application/revert |
149+
| `criteria.ts` | Living quality criteria with L1-L5 promotion |
150+
| `morning-report.ts` | Report generation from JSONL log |
151+
| `test-repo-utils.ts` | Isolated git repo lifecycle management |

0 commit comments

Comments
 (0)