Skip to content

fix(cli): make --threshold override per-test score requirement#885

Open
christso wants to merge 6 commits intomainfrom
fix/result-verdict-use-mean-score
Open

fix(cli): make --threshold override per-test score requirement#885
christso wants to merge 6 commits intomainfrom
fix/result-verdict-use-mean-score

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 31, 2026

Closes #882

Summary

--threshold now configures the per-test score requirement (default 0.8) instead of comparing mean score. The RESULT verdict and exit code are now consistent.

Before (contradictory):

RESULT: FAIL  (28/31 passed, mean score: 0.927)
Suite score: 0.93 (threshold: 0.80) — PASS     ← exit code 0

After (consistent):

RESULT: PASS  (28/31 scored >= 0.8, mean: 0.927)    ← exit code 0

With stricter threshold (--threshold 0.95):

RESULT: FAIL  (20/31 scored >= 0.95, mean: 0.927)   ← exit code 1

Changes

  • calculateEvaluationSummary() accepts optional threshold to recompute passed/failed from raw scores
  • formatEvaluationSummary() shows the threshold in the RESULT line: scored >= 0.8
  • Exit code matches RESULT verdict — no separate threshold check
  • Removed formatThresholdSummary() (no longer needed)
  • Updated CLI help text and docs

Test plan

  • All unit tests pass (including new threshold tests)
  • Build succeeds
  • Pre-push hooks pass

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 31, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: cd69e5d
Status: ✅  Deploy successful!
Preview URL: https://4433fa69.agentv.pages.dev
Branch Preview URL: https://fix-result-verdict-use-mean.agentv.pages.dev

View logs

christso and others added 3 commits March 31, 2026 13:01
The RESULT: PASS/FAIL line used all-must-pass logic (every individual
case must score >= 0.8), while --threshold used mean-based scoring.
This caused confusing contradictory output:

  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS

Now the RESULT line uses mean >= 0.8, consistent with --threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RESULT line and --threshold check now both use pass rate (fraction
of cases scoring >= 0.8) instead of inconsistent metrics. Previously
the RESULT line used all-must-pass while --threshold used mean score.

Before:
  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS

After:
  RESULT: FAIL  (pass rate: 90.3%, 28/31 passed, mean score: 0.927)
  Suite pass rate: 90.3% (threshold: 80.0%) — PASS

Both paths now consistently use pass rate. The RESULT line is
informational (all-must-pass), while --threshold gates CI exit code
against a configurable pass rate minimum.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RESULT: FAIL  (28 passed, 3 failed, mean score: 0.927)

The failed count makes it immediately obvious why the verdict is FAIL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso force-pushed the fix/result-verdict-use-mean-score branch from 00f0b19 to 6803b40 Compare March 31, 2026 13:06
@christso christso changed the title fix(cli): use mean score for RESULT verdict instead of all-must-pass fix(cli): use pass rate for --threshold and clarify RESULT verdict Mar 31, 2026
The --threshold flag now gates on pass rate, not mean score. Update
the CLI help text and docs site to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso changed the title fix(cli): use pass rate for --threshold and clarify RESULT verdict fix(cli): make --threshold override per-test score requirement Mar 31, 2026
christso and others added 2 commits March 31, 2026 13:36
--threshold now configures the per-test score requirement (default 0.8)
instead of comparing mean score. The RESULT verdict and exit code are
now consistent: exit 1 when any test scores below the threshold.

Before (contradictory):
  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS  ← exit code 0

After (consistent):
  RESULT: PASS  (28/31 scored >= 0.8, mean: 0.927)  ← exit code 0

With --threshold 0.95:
  RESULT: FAIL  (20/31 scored >= 0.95, mean: 0.927)  ← exit code 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The threshold now flows from CLI → orchestrator → classifyQualityStatus,
so the live progress line (e.g., "0.750 FAIL") and executionStatus in
JSONL output both respect the custom threshold. Previously these were
hardcoded to PASS_THRESHOLD (0.8) regardless of --threshold.

Added threshold field to RunEvaluationOptions, RunEvalCaseOptions, and
all intermediate call sites (runBatchEvaluation, evaluateCandidate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cli): --threshold compares mean score instead of per-test score

1 participant