Add evalbuff: iterative agent improvement via docs optimization#479
Merged
Add evalbuff: iterative agent improvement via docs optimization#479
Conversation
Evalbuff is an automated overnight loop that improves coding agent performance by optimizing project documentation. It runs eval tasks, judges outputs with living quality criteria (L1-L5), analyzes failures, proposes targeted doc edits, and keeps only changes that measurably improve scores. Agent-agnostic — works with any CLI coding agent. Key components: - cli-runner: agent-agnostic CLI runner (shells out to any command) - criteria: living quality criteria with L1-L5 promotion logic - judge: modified from BuffBench with criteria injection - docs-optimizer: failure analysis + doc writing + score comparison - morning-report: markdown summary from overnight JSONL log - run-evalbuff: main orchestrator with budget caps and resumable state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for criteria (promotion logic, level accumulation), docs-optimizer (apply/overwrite/reject/AGENTS.md creation, compareScores, readCurrentDocs), cli-runner (happy path, diff capture, crash, timeout, CLI not found), and morning-report (normal/empty/error reports, score trajectory, JSONL append). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 integration tests covering: full iteration flow, doc edit attempts, maxIterations budget cap, cost budget cap, resume from state file, and doc revert on score regression. Uses bun mock.module to avoid real LLM calls and remote repo cloning. Also guards run-evalbuff.ts CLI entry point with import.meta.main and adds test:evalbuff script that runs unit + integration tests in separate processes to avoid mock.module leakage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies the complete evalbuff pipeline: 3 eval tasks run through the orchestrator with mock LLM judges, doc edits applied and committed, morning report generated, state tracking, and AGENTS.md TOC created. Total test coverage: 42 tests (35 unit + 6 integration + 1 E2E). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major changes: Judge: Replaced CodebuffClient SDK-based LLM judges with real CLI coding agents (Claude Code, Codex, Gemini) that run IN the repo. Reviewer agents can build, run tests, start the dev server, use browser tools, curl endpoints, check logs — actual E2E verification, not just diff reading. Structured output via result file (evalbuff-review-result.json) with fallback to stdout JSON extraction. Criteria: Shifted from code style (correctness, completeness, pattern consistency, fluency) to E2E verification levels: - L1: Builds, existing tests pass, basic completeness - L2: Feature works E2E (browser/curl/client), logs clean - L3: Edge cases & error states tested E2E, UI verification - L4: Cross-component integration, performance, no regressions - L5: Production readiness (migrations, env vars, error recovery) Orchestrator: Judge now runs inside withTestRepo callback so reviewer agents have access to the live repo. CodebuffClient only used for doc writer (analyzeFailure). Added --reviewers CLI flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fClient Removes the CodebuffClient/SDK dependency from analyzeFailure. Uses Claude CLI with a temp file for the prompt (avoids CLI arg length limits). Adds JSON extraction with markdown fence stripping and validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Write reviewer prompt to file instead of CLI args (avoids length limits) - Use rsync + node_modules symlink instead of cp -r (1.7GB → fast) - Don't pass eval env to reviewers (test API keys break real agents) - Strip API key env vars from coding agent env too - Remove CodebuffClient dependency from orchestrator - Fix cost estimate: was $1/sec, now $0.01/sec - Always log stderr/stdout on reviewer failure - Remove --output-format/--json flags from reviewer commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates a local git repo with a simple subtract bug, generates an eval task, and runs the full evalbuff loop with real CLI agents. No mocks. Usage: bun run evals/evalbuff/run-e2e-test.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evalbuff is now its own workspace package (@codebuff/evalbuff) instead of a subdirectory of evals. Adds package.json, tsconfig.json, and updates workspace config. All 42 tests pass from the new location. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve AGENTS.md conflict: keep main's full content, add evalbuff to repo map and docs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evals/evalbuff/directory — deep-copied from BuffBench and modified for iterative docs optimizationevals/evalbuff/old/run-evalbuffscript toevals/package.jsonTest plan
bunx tsc --noEmit -p evals/tsconfig.jsonpasses (verified — no evalbuff type errors)--max-iterations 2to verify end-to-end flow🤖 Generated with Claude Code