fix(cli): make --threshold override per-test score requirement by christso · Pull Request #885 · EntityProcess/agentv

christso · 2026-03-31T12:56:33Z

Closes #882

Summary

--threshold now configures the per-test score requirement (default 0.8) instead of comparing mean score. The RESULT verdict and exit code are now consistent.

Before (contradictory):

RESULT: FAIL  (28/31 passed, mean score: 0.927)
Suite score: 0.93 (threshold: 0.80) — PASS     ← exit code 0

After (consistent):

RESULT: PASS  (28/31 scored >= 0.8, mean: 0.927)    ← exit code 0

With stricter threshold (--threshold 0.95):

RESULT: FAIL  (20/31 scored >= 0.95, mean: 0.927)   ← exit code 1

Changes

calculateEvaluationSummary() accepts optional threshold to recompute passed/failed from raw scores
formatEvaluationSummary() shows the threshold in the RESULT line: scored >= 0.8
Exit code matches RESULT verdict — no separate threshold check
Removed formatThresholdSummary() (no longer needed)
Updated CLI help text and docs

Test plan

All unit tests pass (including new threshold tests)
Build succeeds
Pre-push hooks pass

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-03-31T12:57:28Z

Deploying agentv with Cloudflare Pages

Latest commit:	`cd69e5d`
Status:	✅ Deploy successful!
Preview URL:	https://4433fa69.agentv.pages.dev
Branch Preview URL:	https://fix-result-verdict-use-mean.agentv.pages.dev

View logs

The RESULT: PASS/FAIL line used all-must-pass logic (every individual case must score >= 0.8), while --threshold used mean-based scoring. This caused confusing contradictory output: RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS Now the RESULT line uses mean >= 0.8, consistent with --threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The RESULT line and --threshold check now both use pass rate (fraction of cases scoring >= 0.8) instead of inconsistent metrics. Previously the RESULT line used all-must-pass while --threshold used mean score. Before: RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS After: RESULT: FAIL (pass rate: 90.3%, 28/31 passed, mean score: 0.927) Suite pass rate: 90.3% (threshold: 80.0%) — PASS Both paths now consistently use pass rate. The RESULT line is informational (all-must-pass), while --threshold gates CI exit code against a configurable pass rate minimum. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RESULT: FAIL (28 passed, 3 failed, mean score: 0.927) The failed count makes it immediately obvious why the verdict is FAIL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The --threshold flag now gates on pass rate, not mean score. Update the CLI help text and docs site to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

--threshold now configures the per-test score requirement (default 0.8) instead of comparing mean score. The RESULT verdict and exit code are now consistent: exit 1 when any test scores below the threshold. Before (contradictory): RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS ← exit code 0 After (consistent): RESULT: PASS (28/31 scored >= 0.8, mean: 0.927) ← exit code 0 With --threshold 0.95: RESULT: FAIL (20/31 scored >= 0.95, mean: 0.927) ← exit code 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The threshold now flows from CLI → orchestrator → classifyQualityStatus, so the live progress line (e.g., "0.750 FAIL") and executionStatus in JSONL output both respect the custom threshold. Previously these were hardcoded to PASS_THRESHOLD (0.8) regardless of --threshold. Added threshold field to RunEvaluationOptions, RunEvalCaseOptions, and all intermediate call sites (runBatchEvaluation, evaluateCandidate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso mentioned this pull request Mar 31, 2026

bug(cli): --threshold compares mean score instead of per-test score #882

Open

christso and others added 3 commits March 31, 2026 13:01

fix(cli): simplify RESULT line — show passed/failed counts clearly

6803b40

RESULT: FAIL (28 passed, 3 failed, mean score: 0.927) The failed count makes it immediately obvious why the verdict is FAIL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso force-pushed the fix/result-verdict-use-mean-score branch from 00f0b19 to 6803b40 Compare March 31, 2026 13:06

christso changed the title ~~fix(cli): use mean score for RESULT verdict instead of all-must-pass~~ fix(cli): use pass rate for --threshold and clarify RESULT verdict Mar 31, 2026

docs: update --threshold docs and CLI help to reflect pass rate

54c2a46

The --threshold flag now gates on pass rate, not mean score. Update the CLI help text and docs site to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso changed the title ~~fix(cli): use pass rate for --threshold and clarify RESULT verdict~~ fix(cli): make --threshold override per-test score requirement Mar 31, 2026

christso and others added 2 commits March 31, 2026 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): make --threshold override per-test score requirement#885

fix(cli): make --threshold override per-test score requirement#885
christso wants to merge 6 commits intomainfrom
fix/result-verdict-use-mean-score

christso commented Mar 31, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 31, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 31, 2026 •

edited

Loading