Add sampling modes for `bt eval` by parkerhendo · Pull Request #103 · braintrustdata/bt

Parker Henderson (parkerhendo) · 2026-04-08T22:15:05Z

TL;DR

Added sampling modes to bt eval that allow running evaluations on subsets of data for faster smoke testing.

What changed?

Added three new sampling modes for evaluations:

--first N flag runs only the first N examples from the dataset
--sample N flag runs a deterministic random sample of N examples
--sample-seed SEED flag controls the random seed used with --sample

When sampling is used, the evaluation is automatically marked as non-final and displays a clear label indicating it's a smoke run. The sampling is implemented using reservoir sampling for random selection and works with arrays, iterables, async iterables, and Braintrust datasets.

The evaluation summary now includes metadata about the run mode (full, first, or sample), whether it's a final run, sample count, and seed information.

How to test?

# Test first N examples
bt eval --first 20 qa.eval.ts

# Test random sampling with seed
bt eval --sample 20 --sample-seed 7 qa.eval.ts

# Test full dataset (existing behavior)
bt eval qa.eval.ts

Verify that sampled runs show appropriate non-final labels in the output and that the sampling produces consistent results when using the same seed.

Why make this change?

This enables faster iteration during evaluation development by allowing developers to quickly test their evaluation logic on small subsets of data before running expensive full evaluations. The clear labeling prevents confusion about whether results represent complete evaluation runs.

github-actions · 2026-04-08T22:24:28Z

Latest downloadable build artifacts for this PR commit cb5a761a8564:

Workflow run: https://github.com/braintrustdata/bt/actions/runs/24202900986
Download all artifacts (GitHub CLI): gh run download 24202900986 --repo braintrustdata/bt
Installers are published from main automatically. To publish one for a PR branch, run release-canary manually via workflow_dispatch.

Available artifact names

``artifacts-build-global
``artifacts-build-local-x86_64-pc-windows-msvc
``artifacts-build-local-x86_64-apple-darwin
``artifacts-build-local-x86_64-unknown-linux-musl
``artifacts-build-local-aarch64-apple-darwin
``artifacts-build-local-aarch64-unknown-linux-gnu
``artifacts-build-local-x86_64-unknown-linux-gnu
``artifacts-build-local-aarch64-unknown-linux-musl
``artifacts-plan-dist-manifest
``cargo-dist-cache

feat: add sampling modes for eval command

2b6b323

refactor(eval-runner): improve RNG and sampling source handling

cb5a761

Parker Henderson (parkerhendo) marked this pull request as ready for review April 10, 2026 18:50

Parker Henderson (parkerhendo) requested review from Ankur Goyal (ankrgyl) and Nathan Selvidge (nselvidge) April 10, 2026 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sampling modes for `bt eval`#103

Add sampling modes for `bt eval`#103
Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
eval-flags

Parker Henderson (parkerhendo) commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Parker Henderson (parkerhendo) commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What changed?

How to test?

Why make this change?

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parker Henderson (parkerhendo) commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading