fix(cli): make --threshold override per-test score requirement#885
Open
fix(cli): make --threshold override per-test score requirement#885
Conversation
Deploying agentv with
|
| Latest commit: |
cd69e5d
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://4433fa69.agentv.pages.dev |
| Branch Preview URL: | https://fix-result-verdict-use-mean.agentv.pages.dev |
The RESULT: PASS/FAIL line used all-must-pass logic (every individual case must score >= 0.8), while --threshold used mean-based scoring. This caused confusing contradictory output: RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS Now the RESULT line uses mean >= 0.8, consistent with --threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RESULT line and --threshold check now both use pass rate (fraction of cases scoring >= 0.8) instead of inconsistent metrics. Previously the RESULT line used all-must-pass while --threshold used mean score. Before: RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS After: RESULT: FAIL (pass rate: 90.3%, 28/31 passed, mean score: 0.927) Suite pass rate: 90.3% (threshold: 80.0%) — PASS Both paths now consistently use pass rate. The RESULT line is informational (all-must-pass), while --threshold gates CI exit code against a configurable pass rate minimum. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RESULT: FAIL (28 passed, 3 failed, mean score: 0.927) The failed count makes it immediately obvious why the verdict is FAIL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
00f0b19 to
6803b40
Compare
The --threshold flag now gates on pass rate, not mean score. Update the CLI help text and docs site to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--threshold now configures the per-test score requirement (default 0.8) instead of comparing mean score. The RESULT verdict and exit code are now consistent: exit 1 when any test scores below the threshold. Before (contradictory): RESULT: FAIL (28/31 passed, mean score: 0.927) Suite score: 0.93 (threshold: 0.80) — PASS ← exit code 0 After (consistent): RESULT: PASS (28/31 scored >= 0.8, mean: 0.927) ← exit code 0 With --threshold 0.95: RESULT: FAIL (20/31 scored >= 0.95, mean: 0.927) ← exit code 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The threshold now flows from CLI → orchestrator → classifyQualityStatus, so the live progress line (e.g., "0.750 FAIL") and executionStatus in JSONL output both respect the custom threshold. Previously these were hardcoded to PASS_THRESHOLD (0.8) regardless of --threshold. Added threshold field to RunEvaluationOptions, RunEvalCaseOptions, and all intermediate call sites (runBatchEvaluation, evaluateCandidate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #882
Summary
--thresholdnow configures the per-test score requirement (default 0.8) instead of comparing mean score. The RESULT verdict and exit code are now consistent.Before (contradictory):
After (consistent):
With stricter threshold (
--threshold 0.95):Changes
calculateEvaluationSummary()accepts optionalthresholdto recompute passed/failed from raw scoresformatEvaluationSummary()shows the threshold in the RESULT line:scored >= 0.8formatThresholdSummary()(no longer needed)Test plan
🤖 Generated with Claude Code