This guide is for development of the Braintrust Python SDK in this repository. If you need to learn more about Braintrust itself, see the Braintrust docs: https://www.braintrust.dev/docs
Use this file as the default playbook for work in this repository.
-
For SDK work, treat
py/as the primary workspace.- Read files under
py/. - Run commands from
py/. - Prefer
py/commands over repo-root wrappers unless the task is clearly repo-level.
- Read files under
-
Use
miseas the source of truth for tools and environment. -
Do not guess test commands or version coverage.
py/noxfile.pyis the source of truth for nox session names, provider/version matrices, and local reproduction commands..github/workflows/checks.yamlis the source of truth for which sessions run in CI, on which Python versions, and outside vs. inside the nox shard matrix.- For provider and integration work, also check
py/src/braintrust/integrations/versioning.py.
-
Keep changes narrow and validate with the smallest relevant test first.
-
Default bug-fix workflow: red -> green.
- First add or update a test that reproduces the issue.
- Then implement the fix.
- Only skip this if the user explicitly asks for a different approach.
-
Prefer real integration coverage over mocks.
- For provider/integration behavior, prefer VCR-backed tests with checked-in cassettes.
- This includes bugs in tracing/span shaping that happen after the SDK returns a real provider payload. If the behavior depends on the provider's actual response shape, treat it as VCR-first work, not mock-first work.
- Be actively skeptical of mock/fake tests for provider integrations. Do not reach for mocks just because they are faster or easier to write.
- Avoid mocks/fakes unless the code is purely local or there is no practical cassette-based option.
-
Do not assume optional provider packages are installed.
- Rely on the active nox session to install what it needs.
-
Do not add
from __future__ import annotationsunless absolutely required.- It can change runtime annotation behavior in ways that break introspection.
- Prefer quoted forward references or
TYPE_CHECKINGguards.
py/: main Python package, tests, examples, nox sessions, build/release workflowpy/src/braintrust/: SDK source- top-level package files: core SDK
wrappers/: wrappersintegrations/: integrations APIcontrib/temporal/: Temporal supportcli/,devserver/: CLI and devservertype_tests/: static + runtime type tests- colocated
test_*.py: local unit/integration tests
py/benchmarks/: pyperf benchmarksintegrations/: separate integration packagesdocs/: supporting docs
Repo bootstrap:
mise install
make developSDK-focused setup:
cd py
make install-devInstall optional provider dependencies only when needed:
cd py
make install-optionalWhen working on the SDK, prefer this sequence:
cd py- Read the relevant code and tests.
- Check
noxfile.pyfor the exact session(s) that cover the change. - If fixing behavior, add/update a reproducing test first.
- Make the smallest possible change.
- Run the narrowest affected test session first.
- Expand coverage only as needed.
- Before handoff, run broader hygiene checks if the change is large enough to justify them.
Common commands:
cd py
make lint
make test-core
nox -lNotes:
cd py && make lintruns pre-commit hooks and thenpylint.cd py && make pylintruns onlypylint.- After major changes, run
cd py && make fixupbefore handoff. - The repo-root
Makefileis a convenience wrapper;py/Makefileandpy/noxfile.pyare authoritative for SDK work.
Do not guess:
- nox session names
- supported provider versions
- which tests a provider session runs
Check py/noxfile.py and .github/workflows/checks.yaml, then reproduce with the exact local session CI uses.
Examples:
cd py
nox -s "test_openai(latest)"
nox -s "test_openai(latest)" -- -k "test_chat_metrics"Version-specific behavior matters in this repo.
Before changing provider/integration behavior:
- Read the relevant session(s) in
py/noxfile.py. - Read
py/src/braintrust/integrations/versioning.py. - Confirm which versions, gates, fallbacks, and feature checks must keep working.
- Do not stop at
latestif the matrix includes older versions or version-specific branches.
test_coreruns without optional vendor packages.test_typesruns pyright, mypy, and pytest onpy/src/braintrust/type_tests/.- CI runs
pylintandtest_typesvia the dedicatedstatic_checksworkflow job on Ubuntu across the configured Python matrix, not inside the shardednoxjob. - The sharded
noxworkflow excludespylintandtest_types; usepy/scripts/nox-matrix.py --exclude-session ...when reproducing shard membership locally. - wrapper coverage is split across dedicated nox sessions by provider/version.
test-wheelis a wheel sanity check and requires a built wheel first.
Use py/src/braintrust/type_tests/ when changing generic type signatures such as:
EvalEvalCaseEvalScorerEvalHooks
Rules:
- add or update a type test for the intended usage pattern
- name files
test_*.py - use absolute imports such as
from braintrust.framework import ...
Run with:
cd py
nox -s test_typesFor provider and integration behavior, the default path is:
- reproduce with a failing cassette-backed test
- implement the fix
- re-run the affected session
Do not downgrade to a mock/fake regression test just because the bug is in local post-processing of a real provider response. If the response shape is part of the behavior under test, the primary regression test should still be cassette-backed. Mock/unit tests may be added as supplemental coverage, not as the main reproduction, unless recording is genuinely impractical.
When deciding between a VCR test and a mock/fake test for provider behavior, bias heavily toward VCR. The burden of proof is on the mock: if you cannot clearly explain why a cassette-backed test is impractical, you should not be using a mock or fake as the primary regression coverage.
Cassette locations:
py/src/braintrust/cassettes/py/src/braintrust/wrappers/cassettes/py/src/braintrust/devserver/cassettes/py/src/braintrust/wrappers/claude_agent_sdk/cassettes/for Claude Agent SDK subprocess transport recordings
Behavior from py/src/braintrust/conftest.py:
- local default:
record_mode="once" - CI default:
record_mode="none" - wheel mode skips VCR-marked tests
- fixtures inject dummy API keys and reset global state
Common commands:
cd py
nox -s "test_openai(latest)"
nox -s "test_openai(latest)" -- --disable-vcr
nox -s "test_openai(latest)" -- --vcr-record=all -k "test_openai_chat_metrics"Claude Agent SDK note:
- it does not use HTTP VCR
- it talks to the bundled
claudesubprocess over stdin/stdout - it uses transport-level cassette helpers instead
Common Claude Agent SDK commands:
cd py
nox -s "test_claude_agent_sdk(latest)"
BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest)"
BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest)" -- -k "test_calculator_with_multiple_operations"Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If unsure, ask the user.
If you touch a hot path such as serialization, deep-copy, span creation, or logging, consider benchmarks.
Quick commands:
cd py
make bench
make bench BENCH_ARGS="--fast"
make bench BENCH_ARGS="-o /tmp/before.json"
make bench BENCH_ARGS="-o /tmp/after.json"
make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.jsonRules:
- benchmark hot-path changes when practical
- benchmark files live in
py/benchmarks/benches/ - new files should be named
bench_<name>.py - each benchmark file must expose
main(runner: pyperf.Runner | None = None) - shared payload builders belong in
py/benchmarks/fixtures.py
See py/benchmarks/benches/bench_bt_json.py for the pattern.
Build from py/:
cd py
make buildCaveat:
py/scripts/template-version.pyrewritespy/src/braintrust/version.pyduring buildpy/Makefilerestores that file afterward withgit checkout
Avoid editing py/src/braintrust/version.py while also running build commands.
- Keep tests close to the code they cover.
- Reuse existing fixtures and cassette patterns.
- Prefer extending an existing cassette-backed test over adding a new mock-heavy test.
- If a change affects examples or integrations, update the nearest example or focused test.
- For CLI/devserver changes, consider whether wheel-mode behavior also needs coverage.
- Changing SDK code? Work from
py/. - Need a test command? Read
py/noxfile.py. - Fixing a bug? Add/update a failing test first.
- Changing provider/integration behavior? Use VCR-backed coverage and check version gates.
- Changing generic typing? Add/update a file in
py/src/braintrust/type_tests/and runnox -s test_types. - Touching a hot path? Consider
cd py && make bench. - Preparing handoff after a major change? Run
cd py && make fixup.