Add evalbuff: iterative agent improvement via docs optimization by jahooma · Pull Request #479 · CodebuffAI/codebuff

jahooma · 2026-03-26T18:22:45Z

Summary

New evals/evalbuff/ directory — deep-copied from BuffBench and modified for iterative docs optimization
Agent-agnostic CLI runner that shells out to any coding agent (Claude Code, Codex, Codebuff, etc.)
Living quality criteria (L1-L5) that get stricter as scores improve — injected into AI judge prompts
Docs optimizer loop: analyze failures → propose doc edits → re-run agent → keep edits that improve scores
Morning report summarizing overnight results (score trajectory, docs kept/reverted, cost)
Resumable state with configurable budget caps (max iterations + max cost)
Archived prior evalbuff brainstorm (stash@{3}) to evals/evalbuff/old/
Added run-evalbuff script to evals/package.json

Test plan

bunx tsc --noEmit -p evals/tsconfig.json passes (verified — no evalbuff type errors)
Run evalbuff on a small eval set with --max-iterations 2 to verify end-to-end flow
Verify morning report generation from JSONL log
Verify criteria promotion triggers after sustained high scores

🤖 Generated with Claude Code

Evalbuff is an automated overnight loop that improves coding agent performance by optimizing project documentation. It runs eval tasks, judges outputs with living quality criteria (L1-L5), analyzes failures, proposes targeted doc edits, and keeps only changes that measurably improve scores. Agent-agnostic — works with any CLI coding agent. Key components: - cli-runner: agent-agnostic CLI runner (shells out to any command) - criteria: living quality criteria with L1-L5 promotion logic - judge: modified from BuffBench with criteria injection - docs-optimizer: failure analysis + doc writing + score comparison - morning-report: markdown summary from overnight JSONL log - run-evalbuff: main orchestrator with budget caps and resumable state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests for criteria (promotion logic, level accumulation), docs-optimizer (apply/overwrite/reject/AGENTS.md creation, compareScores, readCurrentDocs), cli-runner (happy path, diff capture, crash, timeout, CLI not found), and morning-report (normal/empty/error reports, score trajectory, JSONL append). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

6 integration tests covering: full iteration flow, doc edit attempts, maxIterations budget cap, cost budget cap, resume from state file, and doc revert on score regression. Uses bun mock.module to avoid real LLM calls and remote repo cloning. Also guards run-evalbuff.ts CLI entry point with import.meta.main and adds test:evalbuff script that runs unit + integration tests in separate processes to avoid mock.module leakage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verifies the complete evalbuff pipeline: 3 eval tasks run through the orchestrator with mock LLM judges, doc edits applied and committed, morning report generated, state tracking, and AGENTS.md TOC created. Total test coverage: 42 tests (35 unit + 6 integration + 1 E2E). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Major changes: Judge: Replaced CodebuffClient SDK-based LLM judges with real CLI coding agents (Claude Code, Codex, Gemini) that run IN the repo. Reviewer agents can build, run tests, start the dev server, use browser tools, curl endpoints, check logs — actual E2E verification, not just diff reading. Structured output via result file (evalbuff-review-result.json) with fallback to stdout JSON extraction. Criteria: Shifted from code style (correctness, completeness, pattern consistency, fluency) to E2E verification levels: - L1: Builds, existing tests pass, basic completeness - L2: Feature works E2E (browser/curl/client), logs clean - L3: Edge cases & error states tested E2E, UI verification - L4: Cross-component integration, performance, no regressions - L5: Production readiness (migrations, env vars, error recovery) Orchestrator: Judge now runs inside withTestRepo callback so reviewer agents have access to the live repo. CodebuffClient only used for doc writer (analyzeFailure). Added --reviewers CLI flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…fClient Removes the CodebuffClient/SDK dependency from analyzeFailure. Uses Claude CLI with a temp file for the prompt (avoids CLI arg length limits). Adds JSON extraction with markdown fence stripping and validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Write reviewer prompt to file instead of CLI args (avoids length limits) - Use rsync + node_modules symlink instead of cp -r (1.7GB → fast) - Don't pass eval env to reviewers (test API keys break real agents) - Strip API key env vars from coding agent env too - Remove CodebuffClient dependency from orchestrator - Fix cost estimate: was $1/sec, now $0.01/sec - Always log stderr/stdout on reviewer failure - Remove --output-format/--json flags from reviewer commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Creates a local git repo with a simple subtract bug, generates an eval task, and runs the full evalbuff loop with real CLI agents. No mocks. Usage: bun run evals/evalbuff/run-e2e-test.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Evalbuff is now its own workspace package (@codebuff/evalbuff) instead of a subdirectory of evals. Adds package.json, tsconfig.json, and updates workspace config. All 42 tests pass from the new location. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve AGENTS.md conflict: keep main's full content, add evalbuff to repo map and docs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jahooma requested review from brandonkachen and charleslien as code owners March 26, 2026 18:22

jahooma and others added 12 commits March 26, 2026 11:30

evalbuff: add docs for add-deep-thinkers

7122c05

evalbuff: add real E2E test runner script

86d3bce

Creates a local git repo with a simple subtract bug, generates an eval task, and runs the full evalbuff loop with real CLI agents. No mocks. Usage: bun run evals/evalbuff/run-e2e-test.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

evalbuff: consolidate old code and planning docs into evalbuff/old/

76f18c0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

evalbuff: move README to package root

3242bb2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into jahooma/install-gstack

520ea77

Resolve AGENTS.md conflict: keep main's full content, add evalbuff to repo map and docs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jahooma merged commit ef01d52 into main Mar 26, 2026
34 checks passed

jahooma deleted the jahooma/install-gstack branch March 26, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evalbuff: iterative agent improvement via docs optimization#479

Add evalbuff: iterative agent improvement via docs optimization#479
jahooma merged 13 commits intomainfrom
jahooma/install-gstack

jahooma commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jahooma commented Mar 26, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant