Skip to content

Add sampling modes for bt eval#103

Open
Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
eval-flags
Open

Add sampling modes for bt eval#103
Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
eval-flags

Conversation

@parkerhendo
Copy link
Copy Markdown
Contributor

@parkerhendo Parker Henderson (parkerhendo) commented Apr 8, 2026

TL;DR

Added sampling modes to bt eval that allow running evaluations on subsets of data for faster smoke testing.

What changed?

Added three new sampling modes for evaluations:

  • --first N flag runs only the first N examples from the dataset
  • --sample N flag runs a deterministic random sample of N examples
  • --sample-seed SEED flag controls the random seed used with --sample

When sampling is used, the evaluation is automatically marked as non-final and displays a clear label indicating it's a smoke run. The sampling is implemented using reservoir sampling for random selection and works with arrays, iterables, async iterables, and Braintrust datasets.

The evaluation summary now includes metadata about the run mode (full, first, or sample), whether it's a final run, sample count, and seed information.

How to test?

# Test first N examples
bt eval --first 20 qa.eval.ts

# Test random sampling with seed
bt eval --sample 20 --sample-seed 7 qa.eval.ts

# Test full dataset (existing behavior)
bt eval qa.eval.ts

Verify that sampled runs show appropriate non-final labels in the output and that the sampling produces consistent results when using the same seed.

Why make this change?

This enables faster iteration during evaluation development by allowing developers to quickly test their evaluation logic on small subsets of data before running expensive full evaluations. The clear labeling prevents confusion about whether results represent complete evaluation runs.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

Latest downloadable build artifacts for this PR commit cb5a761a8564:

Available artifact names
  • ``artifacts-build-global
  • ``artifacts-build-local-x86_64-pc-windows-msvc
  • ``artifacts-build-local-x86_64-apple-darwin
  • ``artifacts-build-local-x86_64-unknown-linux-musl
  • ``artifacts-build-local-aarch64-apple-darwin
  • ``artifacts-build-local-aarch64-unknown-linux-gnu
  • ``artifacts-build-local-x86_64-unknown-linux-gnu
  • ``artifacts-build-local-aarch64-unknown-linux-musl
  • ``artifacts-plan-dist-manifest
  • ``cargo-dist-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant