Skip to content

Add evalbuff: iterative agent improvement via docs optimization#479

Merged
jahooma merged 13 commits intomainfrom
jahooma/install-gstack
Mar 26, 2026
Merged

Add evalbuff: iterative agent improvement via docs optimization#479
jahooma merged 13 commits intomainfrom
jahooma/install-gstack

Conversation

@jahooma
Copy link
Copy Markdown
Contributor

@jahooma jahooma commented Mar 26, 2026

Summary

  • New evals/evalbuff/ directory — deep-copied from BuffBench and modified for iterative docs optimization
  • Agent-agnostic CLI runner that shells out to any coding agent (Claude Code, Codex, Codebuff, etc.)
  • Living quality criteria (L1-L5) that get stricter as scores improve — injected into AI judge prompts
  • Docs optimizer loop: analyze failures → propose doc edits → re-run agent → keep edits that improve scores
  • Morning report summarizing overnight results (score trajectory, docs kept/reverted, cost)
  • Resumable state with configurable budget caps (max iterations + max cost)
  • Archived prior evalbuff brainstorm (stash@{3}) to evals/evalbuff/old/
  • Added run-evalbuff script to evals/package.json

Test plan

  • bunx tsc --noEmit -p evals/tsconfig.json passes (verified — no evalbuff type errors)
  • Run evalbuff on a small eval set with --max-iterations 2 to verify end-to-end flow
  • Verify morning report generation from JSONL log
  • Verify criteria promotion triggers after sustained high scores

🤖 Generated with Claude Code

Evalbuff is an automated overnight loop that improves coding agent
performance by optimizing project documentation. It runs eval tasks,
judges outputs with living quality criteria (L1-L5), analyzes failures,
proposes targeted doc edits, and keeps only changes that measurably
improve scores. Agent-agnostic — works with any CLI coding agent.

Key components:
- cli-runner: agent-agnostic CLI runner (shells out to any command)
- criteria: living quality criteria with L1-L5 promotion logic
- judge: modified from BuffBench with criteria injection
- docs-optimizer: failure analysis + doc writing + score comparison
- morning-report: markdown summary from overnight JSONL log
- run-evalbuff: main orchestrator with budget caps and resumable state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jahooma and others added 12 commits March 26, 2026 11:30
Tests for criteria (promotion logic, level accumulation), docs-optimizer
(apply/overwrite/reject/AGENTS.md creation, compareScores, readCurrentDocs),
cli-runner (happy path, diff capture, crash, timeout, CLI not found),
and morning-report (normal/empty/error reports, score trajectory, JSONL append).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 integration tests covering: full iteration flow, doc edit attempts,
maxIterations budget cap, cost budget cap, resume from state file,
and doc revert on score regression. Uses bun mock.module to avoid
real LLM calls and remote repo cloning.

Also guards run-evalbuff.ts CLI entry point with import.meta.main
and adds test:evalbuff script that runs unit + integration tests
in separate processes to avoid mock.module leakage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies the complete evalbuff pipeline: 3 eval tasks run through the
orchestrator with mock LLM judges, doc edits applied and committed,
morning report generated, state tracking, and AGENTS.md TOC created.

Total test coverage: 42 tests (35 unit + 6 integration + 1 E2E).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major changes:

Judge: Replaced CodebuffClient SDK-based LLM judges with real CLI coding
agents (Claude Code, Codex, Gemini) that run IN the repo. Reviewer agents
can build, run tests, start the dev server, use browser tools, curl
endpoints, check logs — actual E2E verification, not just diff reading.
Structured output via result file (evalbuff-review-result.json) with
fallback to stdout JSON extraction.

Criteria: Shifted from code style (correctness, completeness, pattern
consistency, fluency) to E2E verification levels:
- L1: Builds, existing tests pass, basic completeness
- L2: Feature works E2E (browser/curl/client), logs clean
- L3: Edge cases & error states tested E2E, UI verification
- L4: Cross-component integration, performance, no regressions
- L5: Production readiness (migrations, env vars, error recovery)

Orchestrator: Judge now runs inside withTestRepo callback so reviewer
agents have access to the live repo. CodebuffClient only used for
doc writer (analyzeFailure). Added --reviewers CLI flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fClient

Removes the CodebuffClient/SDK dependency from analyzeFailure. Uses Claude
CLI with a temp file for the prompt (avoids CLI arg length limits). Adds
JSON extraction with markdown fence stripping and validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Write reviewer prompt to file instead of CLI args (avoids length limits)
- Use rsync + node_modules symlink instead of cp -r (1.7GB → fast)
- Don't pass eval env to reviewers (test API keys break real agents)
- Strip API key env vars from coding agent env too
- Remove CodebuffClient dependency from orchestrator
- Fix cost estimate: was $1/sec, now $0.01/sec
- Always log stderr/stdout on reviewer failure
- Remove --output-format/--json flags from reviewer commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates a local git repo with a simple subtract bug, generates an eval
task, and runs the full evalbuff loop with real CLI agents. No mocks.

Usage: bun run evals/evalbuff/run-e2e-test.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evalbuff is now its own workspace package (@codebuff/evalbuff) instead of
a subdirectory of evals. Adds package.json, tsconfig.json, and updates
workspace config. All 42 tests pass from the new location.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve AGENTS.md conflict: keep main's full content, add evalbuff
to repo map and docs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jahooma jahooma merged commit ef01d52 into main Mar 26, 2026
34 checks passed
@jahooma jahooma deleted the jahooma/install-gstack branch March 26, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant