Add sampling modes for bt eval#103
Open
Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
Open
Add sampling modes for bt eval#103Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
bt eval#103Parker Henderson (parkerhendo) wants to merge 2 commits intomainfrom
Conversation
|
Latest downloadable build artifacts for this PR commit
Available artifact names
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Added sampling modes to
bt evalthat allow running evaluations on subsets of data for faster smoke testing.What changed?
Added three new sampling modes for evaluations:
--first Nflag runs only the first N examples from the dataset--sample Nflag runs a deterministic random sample of N examples--sample-seed SEEDflag controls the random seed used with--sampleWhen sampling is used, the evaluation is automatically marked as non-final and displays a clear label indicating it's a smoke run. The sampling is implemented using reservoir sampling for random selection and works with arrays, iterables, async iterables, and Braintrust datasets.
The evaluation summary now includes metadata about the run mode (
full,first, orsample), whether it's a final run, sample count, and seed information.How to test?
Verify that sampled runs show appropriate non-final labels in the output and that the sampling produces consistent results when using the same seed.
Why make this change?
This enables faster iteration during evaluation development by allowing developers to quickly test their evaluation logic on small subsets of data before running expensive full evaluations. The clear labeling prevents confusion about whether results represent complete evaluation runs.