diff --git a/.agents/skills/commit-message/SKILL.md b/.agents/skills/commit-message/SKILL.md new file mode 100644 index 00000000..a92a0ac1 --- /dev/null +++ b/.agents/skills/commit-message/SKILL.md @@ -0,0 +1,94 @@ +--- +name: commit-message +description: Suggest a Braintrust SDK repo-style commit message from the current diff and conversation. Use when asked to write, suggest, or generate a commit message for the current changes. +--- + +# Commit Message + +Generate a single commit message that matches the style used on `main` in this repo. + +## Repo Style + +Prefer: + +```text +(): +``` + +or, when scope does not add much: + +```text +: +``` + +Recent `main` examples: + +- `feat(openai): trace images api calls` +- `fix(framework): split \`Output\` TypeVar into \`Output\` and \`Expected\`` +- `ref(litellm): migrate litellm wrapper to integrations API` +- `chore: generated SDK types` +- `ci(checks): bump nox shards to 4 and introduce shard weights` +- `test(openai): add vcr regression coverage for stream helpers` +- `docs: document integrations in readme` +- `perf(json): reduce span serialization overhead` + +Notes: + +- This repo uses `ref`, not `refactor`. +- Scope is common and usually names the subsystem, provider, or area being changed. +- Commits on `main` often include a GitHub squash suffix like `(#245)`. Omit that for a normal local commit unless the user explicitly wants a PR title or squash-merge title. +- Most commits in this repo are title-only, but the message should still be useful. Prefer a concise subject that makes the main change obvious, and add a short body when the subject alone would feel too vague. + +## Types + +Use the most specific type: + +- `feat` — new feature +- `fix` — bug fix +- `ref` — restructuring without behavior change +- `perf` — performance improvement +- `test` — tests only +- `docs` — documentation only +- `chore` — tooling, generated files, maintenance, config +- `ci` — GitHub Actions, nox sharding, CI wiring +- `style` — formatting only +- `revert` — reverting a prior change + +## Scope Guidance + +Good scopes in this repo usually look like: + +- provider or integration names: `openai`, `anthropic`, `google_genai`, `claude_agent_sdk`, `langchain` +- SDK areas: `framework`, `cli`, `devserver`, `integrations` +- CI/tooling areas: `checks`, `nox`, `release` + +If the change is broad or generated, omit scope instead of forcing one. + +## Rules + +- Keep the message useful but concise. +- Keep the subject concise and imperative. +- Prefer lowercase style and no trailing period. +- Keep the subject around 72 characters or less when practical. +- Describe the primary change, not every file touched. +- Make the subject specific enough that a reviewer can understand the change without opening the diff. +- If the diff mixes unrelated concerns, say so instead of forcing one message. +- Add a short body when non-obvious rationale, behavior changes, or breaking changes would make the message meaningfully clearer. + +## Workflow + +1. Inspect the current change: + +```bash +git diff HEAD +git diff --cached +git status --short +``` + +2. Pick the main intent, choose `type` and optional `scope`, then write one useful but concise commit message. + +## Output + +Return exactly one commit message in a fenced code block. + +If helpful, add one short sentence after the block explaining the type/scope choice. diff --git a/.agents/skills/sdk-ci-triage/SKILL.md b/.agents/skills/sdk-ci-triage/SKILL.md new file mode 100644 index 00000000..ddb2f270 --- /dev/null +++ b/.agents/skills/sdk-ci-triage/SKILL.md @@ -0,0 +1,319 @@ +--- +name: sdk-ci-triage +description: Triage and reproduce Braintrust Python SDK CI failures. Use when asked why CI failed, to fix broken CI on a PR, to inspect a failing GitHub Actions job, or to map a failing matrix job back to the exact local nox session, provider version, or workflow step that must be reproduced. +--- + +# SDK CI Triage + +Use this skill when the task is about GitHub Actions failures in the Braintrust Python SDK repo. + +This repo's CI is heavily matrixed. Do not guess which local command matches a failing job. Start from the workflow, map the failing job to the exact nox shard or workflow step, then reproduce the narrowest failing command locally. + +## Read First + +Always read: + +- `AGENTS.md` +- `.github/workflows/checks.yaml` +- `py/noxfile.py` +- `py/scripts/nox-matrix.py` +- `py/scripts/session-weights.json` + +Read when relevant: + +- `.github/workflows/adk-py-test.yaml` +- `.github/workflows/langchain-py-test.yaml` +- `py/src/braintrust/conftest.py` for VCR behavior +- the provider test file named in the failing logs +- the provider integration package under `py/src/braintrust/integrations//` +- `py/Makefile` when the failure is in lint, build, or wheel-related steps + +## CI Shape In This Repo + +`checks.yaml` currently has these main categories: + +- `lint`: pre-commit diff-based checks +- `ensure-pinned-actions`: workflow hygiene +- `smoke`: install/import matrix across Python and OS +- `nox`: provider and core test matrix, sharded through `py/scripts/nox-matrix.py` +- `adk-py`: reusable workflow for ADK coverage +- `langchain-py`: reusable workflow for LangChain coverage +- `upload-wheel`: build wheel sanity check + +The most common failure source is the `nox` matrix job. + +## Standard Workflow + +1. Identify the failing PR, run, or job. +2. Inspect the failing job logs with `gh`. +3. Determine which workflow branch failed: + - `lint` + - `smoke` + - `nox` + - reusable workflow (`adk-py`, `langchain-py`) + - `upload-wheel` +4. For `nox` failures, map the matrix job to the exact nox session and pinned provider version from the logs. +5. Reproduce the narrowest failing command locally. +6. Fix the bug. +7. Re-run the narrowest failing command first. +8. Expand only if shared code changed. + +Do not start by running the whole suite locally unless the failure genuinely spans many sessions. + +## GitHub CLI Playbook + +Use `gh` first; do not manually browse logs when the CLI can answer the question directly. + +Common starting points: + +```bash +gh pr checks --repo braintrustdata/braintrust-sdk-python +gh run list --repo braintrustdata/braintrust-sdk-python --limit 10 +gh run view --repo braintrustdata/braintrust-sdk-python +gh run view --repo braintrustdata/braintrust-sdk-python --log-failed +gh run view --job --repo braintrustdata/braintrust-sdk-python --log-failed +``` + +When you need structured data for job names, statuses, and ids: + +```bash +gh api repos/braintrustdata/braintrust-sdk-python/actions/runs//jobs \ + --jq '.jobs[] | {name: .name, status: .status, conclusion: .conclusion, id: .id}' +``` + +When you already know the job id and want raw logs: + +```bash +gh api repos/braintrustdata/braintrust-sdk-python/actions/jobs//logs +``` + +## How To Triage By Job Type + +### `nox` matrix jobs + +Job names look like this: + +```text +nox (3.10, ubuntu-latest, 0) +``` + +That means: + +- Python `3.10` +- OS `ubuntu-latest` +- shard `0` out of 4 + +The workflow runs: + +```bash +mise exec python@ -- python ./py/scripts/nox-matrix.py 4 +``` + +Use a dry run first to see which sessions belong to the shard: + +```bash +mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run +``` + +Then inspect the failing logs to find the exact session name, for example: + +```text +nox > Running session test_google_genai(1.30.0) +... +nox > Session test_google_genai(1.30.0) failed. +``` + +Now reproduce the exact session locally from `py/`: + +```bash +cd py +nox -s "test_google_genai(1.30.0)" +``` + +If the logs show a narrower failing test, use `-- -k` on the session: + +```bash +cd py +nox -s "test_google_genai(1.30.0)" -- -k "interactions" +``` + +Do not substitute `latest` for an older pinned version from CI. + +### `lint` + +The lint job currently runs pre-commit against the PR diff: + +```bash +mise exec -- pre-commit run --from-ref origin/ --to-ref HEAD +``` + +For local reproduction, use the same shape when possible: + +```bash +mise exec -- pre-commit run --from-ref origin/main --to-ref HEAD +``` + +If the failure is really from SDK linting, also check: + +```bash +cd py +make lint +make pylint +``` + +### `smoke` + +The smoke job validates install + import across OS and Python versions. + +Local equivalents: + +```bash +mise exec python@3.10 -- uv pip install -e ./py[all] +mise exec python@3.10 -- uv run --active --no-project python -c 'import braintrust' +``` + +Use this path when CI fails before tests even start. + +### `upload-wheel` + +Reproduce from `py/`: + +```bash +cd py +make install-build-deps +make build +``` + +Remember that build rewrites `py/src/braintrust/version.py` temporarily. + +### Reusable workflows (`adk-py`, `langchain-py`) + +Read the reusable workflow file and map it back to the exact nox or make command it runs. Then reproduce that command locally rather than guessing from the job name alone. + +## Common Failure Patterns In This Repo + +### 1. Version-matrix mismatch + +This is the most common CI triage mistake. + +Symptoms: + +- import errors on older provider versions +- missing attributes or methods in pinned versions +- tests written against `latest` behavior only + +Checklist: + +- read the relevant version tuple in `py/noxfile.py` +- reproduce the exact pinned session from CI +- inspect `py/src/braintrust/integrations/versioning.py` when version routing is involved +- prefer feature detection or declarative version gating over ad hoc branching + +### 2. Optional dependency assumptions + +Symptoms: + +- local machine passes because extra packages are installed +- CI fails in `test_core` or a focused provider session + +Checklist: + +- do not rely on packages outside the active nox session +- read the session in `py/noxfile.py` to see what gets installed +- remember `test_core` intentionally runs without vendor packages + +### 3. VCR / cassette issues + +Symptoms: + +- cassette mismatch +- tests that work locally only with real credentials +- CI failures in VCR-marked provider tests + +Checklist: + +- read `py/src/braintrust/conftest.py` +- remember local default is `record_mode="once"`, CI is `record_mode="none"` +- only re-record cassettes when behavior intentionally changed +- prefer a narrow `-- -k ` rerun under the exact nox session + +### 4. Auto-instrument / import-order regressions + +Symptoms: + +- subprocess auto-instrument tests fail +- import-before-setup versus setup-before-import behavior diverges +- patcher idempotence regressions + +Checklist: + +- inspect `py/src/braintrust/auto.py` +- inspect `py/src/braintrust/integrations/auto_test_scripts/` +- run the narrowest affected provider session first + +### 5. OS-specific failures + +Symptoms: + +- Windows-only path or shell issues +- Linux-only import/install behavior + +Checklist: + +- use the exact job's Python version and read the failing job name carefully +- avoid assuming POSIX-only shell behavior in Python-facing logic +- inspect path normalization and file handling code first when Windows alone fails + +## Reproduction Rules + +- Reproduce the exact failing session before broadening. +- Use the exact provider version shown in CI logs. +- Use `cd py` for SDK test commands. +- Use `mise` as the source of truth for Python versions and environment. +- If the failure is in `nox`, do not jump straight to `make test-core` unless logs suggest shared/core fallout. + +Preferred progression: + +```bash +# 1. Inspect the failing shard +mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run + +# 2. Reproduce the exact session +cd py +nox -s "test_google_genai(1.30.0)" + +# 3. Narrow further if needed +nox -s "test_google_genai(1.30.0)" -- -k "interactions" +``` + +## What To Report Back + +When answering a CI-triage question, report: + +1. the exact failing workflow job +2. the exact failing nox session or command +3. the root cause, not just the symptom +4. the smallest local reproduction command +5. the likely fix path + +Good example structure: + +```text +The failing job is `nox (3.10, ubuntu-latest, 0)`. +Within that shard, the failing session is `test_google_genai(1.30.0)`. +The root cause is that the tests import a symbol that does not exist in google-genai 1.30.0, even though it exists in newer versions. +You can reproduce it locally with `cd py && nox -s "test_google_genai(1.30.0)"`. +The fix is to gate the behavior for older versions or stop assuming the newer API exists in the pinned minimum version. +``` + +## Pitfalls + +Avoid these common mistakes: + +- guessing the session from the provider name without checking `py/noxfile.py` +- reproducing with `latest` when CI failed on an older pinned version +- running from repo root when the real SDK command belongs in `py/` +- fixing the symptom in tests without understanding the provider-version contract +- re-recording cassettes when the behavior change was not intentional +- broadening to many sessions before the exact failing command is green +- assuming local globally-installed packages match the nox environment diff --git a/.agents/skills/sdk-integrations/SKILL.md b/.agents/skills/sdk-integrations/SKILL.md index 112b0f28..07bdd99e 100644 --- a/.agents/skills/sdk-integrations/SKILL.md +++ b/.agents/skills/sdk-integrations/SKILL.md @@ -5,26 +5,21 @@ description: Create or update Braintrust Python SDK integrations built on the in # SDK Integrations -Use this skill for integrations API work under `py/src/braintrust/integrations/`. +Use this skill for integration work under `py/src/braintrust/integrations/`. -If the provider already has a real implementation under `py/src/braintrust/wrappers//` and the task is to move that implementation into the integrations API, switch to `sdk-wrapper-migrations` instead of treating it like a fresh integration. +Use `sdk-wrapper-migrations` instead when the provider already has a real implementation under `py/src/braintrust/wrappers//` and the task is to move that implementation into the integrations API. -## Pick The Nearest Example +## Quick Start -Start from one structural reference and one patching reference instead of designing from scratch: +Before editing: -- ADK (`py/src/braintrust/integrations/adk/`) for direct method patching, `target_module`, `CompositeFunctionWrapperPatcher`, manual `wrap_*()` helpers, priority-based context propagation, and input-side `inline_data` to `Attachment` conversion. -- Agno (`py/src/braintrust/integrations/agno/`) for multi-target patching, version-conditional fallbacks with `superseded_by`, and providers that need several related patchers. -- Anthropic (`py/src/braintrust/integrations/anthropic/`) for constructor patching and a compact provider package with a small public surface. -- Google GenAI (`py/src/braintrust/integrations/google_genai/`) for multimodal serialization, generated media outputs, and output-side `Attachment` handling. +1. Read the shared integration primitives. +2. Read the target provider package. +3. Pick the nearest existing integration as a reference. +4. Decide the span shape before writing patchers. +5. Run the narrowest provider nox session first. -Match an existing repo pattern unless the target provider forces a different shape. - -Choose the example based on the hardest part of the task, not just provider similarity: - -- If the task is mostly about patcher topology, copy the closest patcher layout first. -- If the task is mostly about traced payload shaping, copy the closest tracing implementation first. -- If the task involves generated media or multimodal payloads, start from ADK or Google GenAI before looking at simpler text-only integrations. +Do not design a new integration shape from scratch if an existing provider already matches the problem. ## Read First @@ -35,7 +30,7 @@ Always read: - `py/src/braintrust/integrations/__init__.py` - `py/noxfile.py` -Read when updating an existing integration: +Read these when working on an existing integration: - `py/src/braintrust/integrations//__init__.py` - `py/src/braintrust/integrations//integration.py` @@ -43,38 +38,57 @@ Read when updating an existing integration: - `py/src/braintrust/integrations//tracing.py` - `py/src/braintrust/integrations//test_*.py` -Read when relevant: +Read these when relevant: -- `py/src/braintrust/auto.py` for `auto_instrument()` work +- `py/src/braintrust/auto.py` for `auto_instrument()` changes - `py/src/braintrust/conftest.py` for VCR behavior -- `py/src/braintrust/integrations/auto_test_scripts/` for subprocess auto-instrument tests +- `py/src/braintrust/integrations/auto_test_scripts/` for subprocess auto-instrument coverage - `py/src/braintrust/integrations/adk/test_adk.py` and `py/src/braintrust/integrations/anthropic/test_anthropic.py` for test layout patterns -- `py/src/braintrust/integrations/adk/tracing.py` and `py/src/braintrust/integrations/google_genai/tracing.py` when the provider accepts binary inputs, emits generated files, or otherwise needs `Attachment` objects in traced input/output +- `py/src/braintrust/integrations/adk/tracing.py` and `py/src/braintrust/integrations/google_genai/tracing.py` when handling multimodal content, binary inputs, or generated media + +Do not forget `auto.py` and `auto_test_scripts/`. Import-order and subprocess regressions often only show up there. + +## Pick A Reference + +Start from the nearest current integration: -## Working Sequence +- ADK: direct method patching, `target_module`, `CompositeFunctionWrapperPatcher`, manual `wrap_*()` helpers, context propagation, inline data to `Attachment` +- Agno: multi-target patching, several related patchers, version-conditional fallbacks with `superseded_by` +- Anthropic: compact constructor patching and a small public surface +- Google GenAI: multimodal tracing, generated media, output-side `Attachment` handling -Use this order unless the task is obviously narrower: +Choose the reference based on the hardest part of the task: -1. Read the nearest provider package and the shared integration primitives. -2. Decide which public surface is being patched: constructor, top-level function, client method, stream method, or manual `wrap_*()` helper. -3. Decide what the span should look like before writing patchers: - - what belongs in `input` - - what belongs in `output` - - what belongs in `metadata` - - what belongs in `metrics` +- patcher topology +- tracing shape +- streaming behavior +- multimodal or binary payload handling + +## Default Workflow + +Use this order unless the task is clearly narrower: + +1. Read shared primitives and the provider package. +2. Decide which public surface is being patched. +3. Define the span shape: + - `input` + - `output` + - `metadata` + - `metrics` + - `error` when failures matter 4. Implement or update patchers. 5. Implement or update tracing helpers. -6. Add or update focused tests in the provider package. +6. Add or update focused tests. 7. Run the narrowest nox session first, then expand only if shared code changed. -Do not start by wiring patchers and only later asking what the logged span should contain. The traced shape should drive the tracing helper design from the start. +Do not start by wiring wrappers and only later decide what the span should contain. ## Route The Task ### New provider integration 1. Create `py/src/braintrust/integrations//`. -2. Add the normal split unless the provider is exceptionally small: +2. Use this layout unless the provider is exceptionally small: - `__init__.py` - `integration.py` - `patchers.py` @@ -91,73 +105,77 @@ Do not start by wiring patchers and only later asking what the logged span shoul 1. Read the current provider package before editing. 2. Change only the affected patchers, tracing helpers, exports, tests, and cassettes. 3. Preserve the provider's public setup and `wrap_*()` surface unless the task explicitly changes it. -4. Keep repo-level changes narrow; do not touch `auto.py`, `integrations/__init__.py`, or `py/noxfile.py` unless the task actually requires it. -5. Preserve existing span shape conventions unless the task is intentionally improving or correcting them. +4. Do not touch `auto.py`, `integrations/__init__.py`, or `py/noxfile.py` unless the task requires it. +5. Even if `auto.py` does not change, check whether the behavior change also needs an auto-instrument subprocess test update. +6. Preserve existing span shape conventions unless the task is intentionally correcting them. ### `auto_instrument()` only 1. Update `py/src/braintrust/auto.py`. -2. Use `_instrument_integration(...)` instead of adding a custom `_instrument_*` helper when the integration fits the standard pattern. +2. Prefer `_instrument_integration(...)` over a custom `_instrument_*` helper when the standard pattern fits. 3. Add the integration import near the other integration imports. 4. Add or update the relevant subprocess auto-instrument test. -## Package Layout +## Package Layout Rules -Keep provider-local code inside `py/src/braintrust/integrations//`. +Keep provider-specific behavior in `py/src/braintrust/integrations//`. -If tracing or normalization logic is genuinely shared across multiple integrations, prefer adding it to `py/src/braintrust/integrations/utils.py` instead of copying it into each provider package. Avoid duplicating code between integrations unless there is a clear provider-specific reason the behavior must diverge. +Typical ownership: -Typical file ownership: +- `__init__.py`: public exports, `setup_()`, public `wrap_*()` helpers +- `integration.py`: `BaseIntegration` subclass and patcher registration +- `patchers.py`: patchers and manual `wrap_*()` helpers +- `tracing.py`: request/response normalization, metadata extraction, stream handling, error logging +- `test_*.py`: provider behavior tests +- `cassettes/`: VCR recordings for provider HTTP traffic -- `__init__.py`: export the integration class, `setup_()`, and public `wrap_*()` helpers -- `integration.py`: define the `BaseIntegration` subclass and register patchers -- `patchers.py`: define patchers and manual `wrap_*()` helpers -- `tracing.py`: keep provider-specific tracing, stream handling, normalization, and metadata extraction; move cross-integration helpers to `py/src/braintrust/integrations/utils.py` -- `test_*.py`: keep provider behavior tests next to the integration -- `cassettes/`: keep VCR recordings next to the integration tests when the provider uses HTTP +Keep `integration.py` thin. -Keep `integration.py` thin. Put provider behavior in provider-local modules, not in shared integration primitives, unless the shared abstraction is genuinely missing. +If logic is genuinely shared across integrations, move it to `py/src/braintrust/integrations/utils.py` instead of copying it into multiple providers. ## Integration Rules -Set the integration class up declaratively: +Set up the integration declaratively: - set `name` - set `import_names` - set `patchers` - set `min_version` or `max_version` only when feature detection is not enough -Keep span creation, metadata extraction, stream aggregation, error logging, and output normalization in `tracing.py`. +Prefer feature detection first and version checks second. Use: + +- `detect_module_version(...)` +- `version_satisfies(...)` +- `make_specifier(...)` + +Let `BaseIntegration.resolve_patchers()` reject duplicate patcher ids. Do not hide duplicates. -Preserve provider behavior. Do not let tracing-only code change provider return values, control flow, or error behavior except where the task explicitly requires it. +Preserve provider behavior. Tracing code must not change return values, control flow, or error behavior unless the task explicitly requires it. -Generate structured spans. Do not pass raw `args` and `kwargs` straight into traced spans unless the provider API already exposes the exact stable schema you want to log. Instead: +Keep sync and async traced schemas aligned when the provider exposes both. -- Build a provider-shaped `input` object that names the important request fields explicitly, for example model, messages/contents, prompt, config, tools, or options. -- Build an `output` object that captures the useful response payload in normalized form instead of logging opaque SDK objects. -- Put secondary facts in `metadata`, such as provider ids, finish reasons, model versions, safety attributes, or normalized request/response annotations that are useful but not the primary payload. -- Put timings and token/accounting values in `metrics`, such as `start`, `end`, `duration`, `time_to_first_token`, `prompt_tokens`, `completion_tokens`, and `tokens`. -- Drop noisy transport-level or duplicate fields rather than mirroring the full raw call surface. -- Add small provider-local helpers in `tracing.py` to extract `input`, `output`, `metadata`, and `metrics` before opening or closing spans. +## Span Design Rules -Aim for spans that are readable in the UI without requiring someone to reverse-engineer the provider SDK's calling convention from positional arguments. +Build readable spans. Do not dump raw `args` and `kwargs` unless the provider API already exposes a clean schema. -Shape spans by semantics, not by the provider SDK object model: +Use this rubric: -- `input` is the meaningful request a human would describe, not the raw Python call signature. -- `output` is the meaningful result, not a provider response class dumped wholesale. -- `metadata` is for supporting context that helps interpretation but is not the main payload. -- `metrics` is for timings, token counts, and similar numeric accounting. +- `input`: the meaningful user request +- `output`: the meaningful provider result +- `metadata`: supporting context such as ids, finish reasons, safety data, model revisions +- `metrics`: timing and numeric accounting such as token counts or elapsed time +- `error`: exceptions or failure information Good span shaping usually means: -- flattening positional arguments into named fields -- omitting duplicate values that appear in both request and response objects -- normalizing provider-specific classes into dicts/lists/scalars -- aggregating streaming chunks into one final `output` plus stream-specific `metrics` -- preserving useful provider identifiers without leaking transport noise +- flatten positional arguments into named fields +- normalize provider SDK objects into dicts, lists, or scalars +- drop duplicate or noisy transport fields +- aggregate streaming chunks into one final `output` plus stream-specific `metrics` -Use provider-local helper functions instead of building spans inline inside wrappers. A good pattern is: +Keep wrapper bodies thin: prepare traced input, open the span, call the provider, normalize the result, and log `output`/`metadata`/`metrics`. + +Prefer provider-local helpers in `tracing.py`, for example: ```python def _prepare_traced_call(args: list[Any], kwargs: dict[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]: @@ -168,129 +186,84 @@ def _process_result(result: Any, start: float) -> tuple[dict[str, Any], dict[str ... ``` -Keep wrapper bodies thin: prepare input, open the span, call the provider, normalize the result, then log `output`, `metadata`, and `metrics`. - -When deciding where a field belongs: - -- Put it in `input` if the caller intentionally supplied it. -- Put it in `output` if it is the core result a user would care about. -- Put it in `metadata` if it explains the result but is not the result itself. -- Put it in `metrics` if it is numeric operational accounting or timing. -- Put it in `error` if the call failed and you want the span to record the exception or failure message instead of pretending the failure is ordinary output. - -Distinguish span payload fields from span setup fields: - -- Treat `input`, `output`, `metadata`, `metrics`, and `error` as the main logged payload fields. -- Treat `name` plus `type` or `span_attributes` as span identity/classification, not as payload. -- Use `parent` only when you need to attach the span to an explicit exported parent instead of relying on current-span context. -- Use `start_time` when the true start happened before the wrapper got control and you need accurate duration or time-to-first-token accounting. +Treat binary payloads as attachments, not logged bytes: -Examples: - -- prompt/model/tools/config belong in `input` -- generated text, tool calls, embeddings summary, generated images summary, or normalized message content belong in `output` -- provider request ids, finish reasons, safety annotations, cached-hit indicators, or model revision identifiers belong in `metadata` -- token counts, elapsed time, time-to-first-token, retry counts, or billable character counts belong in `metrics` -- exceptions, provider errors, and wrapper failures belong in `error` - -Across the current integrations, `input`/`output`/`metadata`/`metrics` are the common structured logging fields, and `error` is the main additional event field used during failures. Other values should usually live inside one of those containers unless they are truly span-level controls like `name`, `type`, `span_attributes`, `parent`, or `start_time`. - -Treat provider-owned binary payloads as attachments, not raw logged bytes. When traced input or output contains inline media, generated files, or other uploadable content: - -- Convert raw `bytes` into `braintrust.logger.Attachment` objects in provider-local tracing helpers instead of logging raw bytes or large base64 blobs. -- Use the repo's existing message/content shapes when embedding attachments in traced payloads. For multimodal content this is often `{"image_url": {"url": attachment}}`, even when the MIME type is not literally an image. -- Preserve ordinary remote URLs as strings. Only convert provider-owned binary content or data-URL style payloads that Braintrust should upload to object storage. -- Keep structured metadata alongside the attachment, such as MIME type, size, safety attributes, or provider ids, so spans stay inspectable without reading the blob. - -Prefer feature detection first and version checks second. Use: - -- `detect_module_version(...)` -- `version_satisfies(...)` -- `make_specifier(...)` - -Let `BaseIntegration.resolve_patchers()` reject duplicate patcher ids; do not silently paper over duplicates. - -If a provider surface has both sync and async variants, try to keep the traced schema aligned across both paths. Differences in implementation are fine; differences in logged shape should be intentional. +- convert raw `bytes` to `braintrust.logger.Attachment` +- preserve normal remote URLs as strings +- keep useful metadata such as MIME type, size, or provider ids next to the attachment +- follow existing repo content shapes for multimodal payloads ## Patcher Rules -Create one patcher per coherent patch target. Split unrelated targets into separate patchers. - -Use `FunctionWrapperPatcher` for one import path or one constructor/method surface, for example: +Create one patcher per coherent patch target. -- `ProviderClient.__init__` -- `client.responses.create` +Prefer: -Use `CompositeFunctionWrapperPatcher` when several closely related targets should appear as one patcher, for example: +- `FunctionWrapperPatcher` for one import path or one constructor/method surface +- `CompositeFunctionWrapperPatcher` for one logical surface spread across multiple related targets +- `CallbackPatcher` for setup side effects after applicability succeeds -- sync and async variants of the same method -- the same logical surface patched across multiple modules +Use `target_module` when the patch target lives outside the module named by `import_names`, especially for optional or deep submodules. -Use `CallbackPatcher` when the integration only needs a setup side effect after applicability succeeds, not in-place wrapping or class replacement. Good fits include registering a global callback handler, invoking a provider-owned configure hook, or installing default integration state. Gate it with `target_module` when the side effect depends on an optional module, and use `state_getter` when idempotence should be derived from integration-managed state instead of a marker on the imported root. +Use `superseded_by` for version-conditional fallbacks instead of custom target-selection logic. -Set `target_module` when the patch target lives outside the module named by `import_names`, especially for optional or deep submodules. Failed `target_module` imports should make the patcher skip cleanly through `applies()`. +Use lower `priority` only when patch ordering really matters, such as context propagation before tracing. -Use `superseded_by` for version-conditional mutual exclusion. Express fallback relationships declaratively instead of reproducing `hasattr` logic in custom `applies()` methods whenever possible. - -Expose manual wrapping helpers through `wrap_target()`: +Manual wrapping helpers should be thin: ```python def wrap_agent(Agent: Any) -> Any: return AgentPatcher.wrap_target(Agent) ``` -Use lower `priority` values only when ordering matters, such as context propagation before tracing patchers. - -Require every patcher to provide: +Require every patcher to have: - a stable `name` -- version gating only when needed - clean existence checks +- version gating only when necessary - idempotence through the base patcher marker -## Testing +## Testing Rules -Keep integration tests in the provider package. +Keep tests in the provider package. -Use `@pytest.mark.vcr` for real provider network behavior. Prefer recorded provider traffic over mocks or fakes. Use mocks or fakes only for cases that are hard to drive through recordings, such as: +Default bug-fix workflow: red -> green. + +- First add or update a focused test that reproduces the integration bug. +- Then implement the fix. +- Only skip this when the task explicitly asks for a different approach. + +Prefer VCR-backed real provider coverage with `@pytest.mark.vcr`. Use mocks or fakes only for cases that are hard to drive through recordings, such as: - narrow error injection - local version-routing logic - patcher existence checks -Write tests against the emitted span shape, not just the provider return value. A tracing change is incomplete if the provider call still works but the logged span becomes noisy, incomplete, or inconsistent. +Test emitted spans, not just provider return values. Cover the surfaces that changed: - direct `wrap_*()` behavior -- `setup()` patching for newly created clients or classes +- setup-time patching - sync behavior - async behavior - streaming behavior - idempotence - failure and error logging -- patcher resolution and duplicate detection -- attachment conversion for binary inputs or generated media, including assertions that traced payloads contain `Attachment` objects rather than raw bytes -- span structure, including assertions that the traced span exposes meaningful `input`, `output`, `metadata`, and `metrics` rather than opaque raw call arguments - -For span assertions, prefer checking the specific normalized fields that matter: +- patcher resolution and duplicate detection when relevant +- attachment conversion for binary inputs or generated media +- span structure, especially `input`, `output`, `metadata`, and `metrics` -- the `input` contains the expected model/messages/prompt/config fields -- the `output` contains normalized provider results rather than opaque SDK instances -- the `metadata` contains finish reasons, ids, or annotations in the expected place -- the `metrics` contain the expected timing or token fields when the provider returns them -- binary payloads are represented as `Attachment` objects where applicable +For streaming changes, verify both: -If a change affects streaming, verify both: - -- intermediate behavior still returns the provider's expected iterator or async iterator +- the provider still returns the expected iterator or async iterator - the final logged span contains the aggregated `output` and stream-specific `metrics` -Keep VCR cassettes in `py/src/braintrust/integrations//cassettes/`. Re-record only when the behavior change is intentional. +Keep VCR cassettes in `py/src/braintrust/integrations//cassettes/`. Re-record only when behavior intentionally changes. -When the provider returns binary HTTP responses or generated media, make cassette sanitization part of the change if needed so recorded fixtures do not store raw file bytes. +When the provider returns binary HTTP responses or generated media, sanitize cassettes as needed so fixtures do not store raw file bytes. -When choosing commands, confirm the real session name in `py/noxfile.py` instead of assuming it matches the provider folder. Examples in this repo include `test_agno`, `test_anthropic`, and `test_google_adk`. +When choosing test commands, confirm the actual session name in `py/noxfile.py` instead of assuming it matches the provider folder. ## Commands @@ -302,20 +275,24 @@ cd py && make test-core cd py && make lint ``` -## Validation +## Validation Checklist - Run the narrowest provider session first. +- If the change touches patchers, setup behavior, import timing, or anything that could affect `auto_instrument()`, run the relevant subprocess auto-instrument test from `py/src/braintrust/integrations/auto_test_scripts/`. - Run the relevant auto-instrument subprocess test if `auto.py` changed. - Run `cd py && make test-core` if shared integration code changed. - Run `cd py && make lint` before handoff when shared files or repo-level wiring changed. -## Pitfalls +## Common Mistakes + +Avoid these failures: -- Treating a wrapper migration as fresh integration work. -- Changing shared integration primitives when the provider-specific package should own the behavior. -- Combining unrelated targets into one patcher. -- Forgetting repo-level touch points for new providers: `integrations/__init__.py`, `py/noxfile.py`, and sometimes `auto.py`. -- Forgetting async or streaming coverage. -- Re-recording cassettes when behavior did not intentionally change. -- Adding a custom `_instrument_*` helper where `_instrument_integration()` already fits. -- Forgetting `target_module` for deep or optional submodule patch targets. +- treating a wrapper migration as fresh integration work +- changing shared integration primitives when provider-local code should own the behavior +- combining unrelated patch targets into one patcher +- forgetting repo-level wiring for new providers: `integrations/__init__.py`, `py/noxfile.py`, and sometimes `auto.py` +- forgetting the subprocess auto-instrument tests +- forgetting async or streaming coverage +- re-recording cassettes when behavior did not intentionally change +- adding a custom `_instrument_*` helper where `_instrument_integration()` already fits +- forgetting `target_module` for deep or optional patch targets diff --git a/.agents/skills/sdk-vcr-workflows/SKILL.md b/.agents/skills/sdk-vcr-workflows/SKILL.md new file mode 100644 index 00000000..cec9a01e --- /dev/null +++ b/.agents/skills/sdk-vcr-workflows/SKILL.md @@ -0,0 +1,267 @@ +--- +name: sdk-vcr-workflows +description: Work with Braintrust Python SDK VCR and cassette-backed tests. Use when adding or updating cassette-backed provider tests, deciding whether to re-record cassettes, debugging VCR failures, converting mock-heavy coverage to VCR, or handling cassette hygiene for provider integrations and Claude Agent SDK transport recordings. +--- + +# SDK VCR Workflows + +Use this skill for VCR and cassette work in the Braintrust Python SDK repo. + +This repo prefers real recorded integration coverage over mocks for provider behavior. Do not treat cassette work as an afterthought. The normal bug-fix path here is: reproduce with a failing cassette-backed test, implement the fix, and only re-record when the behavior change is intentional. + +## Read First + +Always read: + +- `AGENTS.md` +- `py/noxfile.py` +- `py/src/braintrust/conftest.py` +- the target provider test file under `py/src/braintrust/integrations//` or `py/src/braintrust/wrappers/` + +Read when relevant: + +- the provider integration package under `py/src/braintrust/integrations//` +- the provider cassette directory under `py/src/braintrust/integrations//cassettes/` +- `py/src/braintrust/cassettes/` +- `py/src/braintrust/wrappers/cassettes/` +- `py/src/braintrust/devserver/cassettes/` +- `py/src/braintrust/wrappers/claude_agent_sdk/` for subprocess transport recordings instead of HTTP VCR +- `py/src/braintrust/integrations/auto_test_scripts/` when auto-instrument subprocess coverage and cassettes interact + +Do not guess cassette locations or session names. Check the current tree and `py/noxfile.py` first. + +## Repo Rules To Follow + +These rules must stay aligned with `AGENTS.md`: + +- Work from `py/` for SDK tasks. +- Use `mise` as the source of truth for tools and environment. +- Do not guess nox session names or provider/version coverage. +- Default bug-fix workflow is red -> green. +- Prefer VCR-backed provider tests over mocks or fakes whenever practical. +- Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If unsure, ask the user. +- Do not assume optional provider packages are installed outside the active nox session. + +## How VCR Works In This Repo + +Behavior from `py/src/braintrust/conftest.py` is the source of truth. + +Current defaults: + +- local default: `record_mode="once"` +- CI default: `record_mode="none"` +- wheel mode skips VCR-marked tests +- fixtures inject dummy API keys and reset global state + +Implications: + +- A test that passes locally by silently recording new traffic may still fail in CI if the cassette is missing or stale. +- CI will not save you by recording fresh traffic. If the cassette is wrong, CI should fail. +- Reproducing a CI VCR failure locally usually means running the exact nox session named in `py/noxfile.py`, not raw pytest in whatever environment happens to exist. + +## Standard Workflow + +1. Identify the exact provider test or cassette-backed failure. +2. Read `py/noxfile.py` and reproduce with the exact provider session and version. +3. If fixing a bug, add or update a focused failing cassette-backed test first. +4. Decide whether the intended behavior should reuse an existing cassette or requires a new or updated one. +5. Re-record only the narrowest affected test when behavior intentionally changes. +6. Re-run the exact provider session after recording. +7. Expand only if shared integration code changed. + +Do not re-record a broad provider suite when one focused test is enough. + +## Choosing The Right Coverage Style + +### Prefer cassette-backed tests when: + +- validating real provider request or response shape +- testing tracing/span contents derived from actual provider payloads +- checking streaming aggregation against real wire behavior +- confirming version-specific provider behavior that mocks could misrepresent +- replacing older fake- or mock-heavy provider tests + +### Use mocks/fakes only when: + +- injecting a narrow error path that is hard to provoke via recordings +- testing purely local patcher resolution or applicability logic +- validating behavior that does not need real provider traffic +- the code under test is entirely internal and a cassette would add little value + +If the task is about provider behavior and you can reasonably record it, prefer VCR. + +## Cassette Locations + +Common locations in this repo: + +- `py/src/braintrust/cassettes/` +- `py/src/braintrust/wrappers/cassettes/` +- `py/src/braintrust/devserver/cassettes/` +- `py/src/braintrust/integrations//cassettes/` +- `py/src/braintrust/wrappers/claude_agent_sdk/cassettes/` for Claude Agent SDK subprocess transport recordings + +Keep cassettes next to the tests they support. When migrating or moving tests, move the cassettes with them. + +## Commands + +Run from `py/` unless the task is clearly repo-level. + +Common provider VCR commands: + +```bash +cd py +nox -s "test_openai(latest)" +nox -s "test_openai(latest)" -- -k "test_openai_chat_metrics" +nox -s "test_openai(latest)" -- --disable-vcr +nox -s "test_openai(latest)" -- --vcr-record=all -k "test_openai_chat_metrics" +``` + +Generic reproduction shape: + +```bash +cd py +nox -s "test_()" +nox -s "test_()" -- -k "test_name" +nox -s "test_()" -- --vcr-record=all -k "test_name" +``` + +Claude Agent SDK transport-recording commands: + +```bash +cd py +nox -s "test_claude_agent_sdk(latest)" +BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest)" +BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest)" -- -k "test_calculator_with_multiple_operations" +``` + +## Re-recording Rules + +Re-record only when the behavior change is intentional. + +Good reasons to re-record: + +- the traced request or response shape intentionally changed +- the provider behavior you are validating intentionally changed +- the old cassette covered the wrong scenario and the test now targets the correct one +- a new focused cassette-backed regression test is being added + +Bad reasons to re-record: + +- masking an unexplained failure +- making a flaky test pass without understanding the difference +- updating many cassettes when only one focused scenario changed +- papering over version mismatches or incorrect assumptions about old provider releases + +If a cassette diff is large, verify that the behavioral change is truly intended before keeping it. + +## Narrow Recording Strategy + +When you need new traffic, record the smallest thing possible: + +1. use the exact session from `py/noxfile.py` +2. target a single test with `-k` when practical +3. record only the affected scenario +4. inspect the cassette diff before moving on + +Preferred shape: + +```bash +cd py +nox -s "test_google_genai(latest)" -- --vcr-record=all -k "test_interactions_create_and_get" +``` + +Avoid broad commands like re-recording an entire provider suite unless the change genuinely affects the whole suite. + +## Debugging VCR Failures + +When a cassette-backed test fails, check these in order: + +1. **Wrong nox session or provider version** + - Did you reproduce under the exact session from `py/noxfile.py`? + - Are you accidentally testing `latest` when CI pins an older version? + +2. **Cassette missing or stale** + - Does the expected cassette file exist in the right directory? + - Did the test move without moving the cassette? + - Does the cassette still match the intended request shape? + +3. **Behavior changed but cassette did not** + - Did the code intentionally change the payload, headers, streaming sequence, or traced output? + - If yes, re-record narrowly. + +4. **Behavior did not change, but the test became too strict** + - Prefer fixing brittle assertions before churning cassettes. + - Assert the meaningful span structure, not incidental provider noise. + +5. **Local-only pass, CI fail** + - Remember CI uses `record_mode="none"`. + - A local silent record can hide a missing cassette or wrong request shape. + +6. **Optional dependency or auth confusion** + - Do not rely on globally installed provider packages. + - Use the active nox session. + - Read `py/src/braintrust/conftest.py` before assuming credential behavior. + +## Cassette Hygiene + +Keep cassettes reviewable and intentional. + +Prefer: + +- one focused cassette per scenario or regression when practical +- reusing existing cassette patterns in the same provider package +- keeping fixture names stable and descriptive +- moving cassettes with tests when code moves from wrappers to integrations + +Be careful about: + +- duplicate old cassettes left behind after moves or renames +- storing unnecessary raw binary content when sanitization is possible +- overfitting tests to incidental details in recordings +- broad cassette churn from unfocused recording commands + +When the provider returns binary HTTP responses or generated media, sanitize the recordings as needed so the repo does not keep unnecessary raw file bytes. + +## Claude Agent SDK Exception + +Claude Agent SDK coverage is cassette-backed, but not through HTTP VCR. + +Important differences: + +- it talks to the bundled `claude` subprocess over stdin/stdout +- it uses transport-level cassette helpers instead of HTTP request recording +- use `BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all` when re-recording + +Do not try to force ordinary HTTP VCR patterns onto Claude Agent SDK subprocess tests. + +## Relationship To Other Skills + +- Use `sdk-integrations` when the main task is integration implementation, patchers, tracing, or provider package structure. +- Use this skill when the main difficulty is cassette-backed test design, VCR behavior, re-recording decisions, or cassette hygiene. +- Use `sdk-ci-triage` when the entry point is a GitHub Actions failure, even if the fix eventually involves VCR. +- Use `sdk-wrapper-migrations` when moving a legacy wrapper into `py/src/braintrust/integrations/` while preserving tests and cassettes. + +## Validation Checklist + +- Read `py/noxfile.py` before choosing commands. +- Reproduce under the exact provider session and version. +- Use red -> green when fixing a bug. +- Prefer a focused cassette-backed test over mocks/fakes. +- Re-record only when behavior intentionally changed. +- Record the narrowest affected test first. +- Inspect the cassette diff before finishing. +- Re-run the relevant provider session after recording. +- If patching or `auto_instrument()` changed, also check the relevant subprocess auto-instrument coverage. + +## Common Mistakes + +Avoid these failures: + +- running raw pytest in an ad hoc environment instead of the exact nox session +- re-recording against `latest` when CI covers an older pinned version +- letting local `record_mode="once"` hide a missing or stale cassette +- replacing meaningful assertions with cassette churn +- using mocks for provider behavior that should be validated from real recordings +- forgetting that Claude Agent SDK uses subprocess transport recordings, not HTTP VCR +- leaving duplicate stale cassettes behind after moving tests or renaming scenarios +- broad re-records that create unnecessary review noise diff --git a/.agents/skills/sdk-wrapper-migrations/SKILL.md b/.agents/skills/sdk-wrapper-migrations/SKILL.md deleted file mode 100644 index 0aff31da..00000000 --- a/.agents/skills/sdk-wrapper-migrations/SKILL.md +++ /dev/null @@ -1,223 +0,0 @@ ---- -name: sdk-wrapper-migrations -description: Migrate Braintrust Python SDK legacy wrapper implementations to the integrations API. Use when moving an existing provider from `py/src/braintrust/wrappers/` into `py/src/braintrust/integrations/` while preserving old import paths, public helpers, tests, cassettes, tracing behavior, auto-instrument hooks, and nox coverage. ---- - -# SDK Wrapper Migrations - -Migrate an existing wrapper-backed provider to the integrations API without breaking the old wrapper import path. - -Prefer this skill only when both of these are true: - -- Find an existing provider implementation under `py/src/braintrust/wrappers/`. -- Need the end state to be an integration package under `py/src/braintrust/integrations/` plus a thin compatibility wrapper. - -Use `sdk-integrations` instead when the task is integration work that does not start from a legacy wrapper. - -Do not reconstruct migrations from old commit history. Start from the current tree and copy the nearest current pattern. - -## Target End State - -Finish with this structure: - -- provider logic in `py/src/braintrust/integrations//` -- provider tests in `py/src/braintrust/integrations//` -- provider cassettes in `py/src/braintrust/integrations//cassettes/` when applicable -- `auto_instrument()` pointing at the integration when the provider participates in auto patching -- the wrapper reduced to compatibility re-exports with the old public surface intact - -Do not leave tracing helpers, patchers, or setup orchestration behind in the wrapper. - -## Current References - -Use the nearest current provider instead of inventing a layout: - -- ADK: use `py/src/braintrust/integrations/adk/` as the main structural reference for package layout, patchers, tracing split, tests, cassettes, auto-test scripts, and thin wrapper delegation. -- Agno: use `py/src/braintrust/integrations/agno/` for multi-method patching, `CompositeFunctionWrapperPatcher`, raw wrapt wrappers in `tracing.py`, and version-conditional fallbacks using `superseded_by`. -- Anthropic: use `py/src/braintrust/integrations/anthropic/` for compact constructor patching and a minimal compatibility wrapper. - -Match one of those patterns unless the provider has a concrete reason to differ. - -## Read First - -Always read: - -- the existing legacy wrapper under `py/src/braintrust/wrappers//` or `py/src/braintrust/wrappers/.py` -- `py/src/braintrust/integrations/base.py` -- `py/src/braintrust/integrations/versioning.py` -- `py/src/braintrust/auto.py` -- `py/noxfile.py` - -Read these current migration examples before editing: - -- `py/src/braintrust/integrations/adk/__init__.py` -- `py/src/braintrust/integrations/adk/integration.py` -- `py/src/braintrust/integrations/adk/patchers.py` -- `py/src/braintrust/integrations/adk/tracing.py` -- `py/src/braintrust/integrations/agno/patchers.py` -- `py/src/braintrust/integrations/agno/tracing.py` -- `py/src/braintrust/integrations/anthropic/__init__.py` -- `py/src/braintrust/integrations/anthropic/integration.py` -- `py/src/braintrust/integrations/anthropic/patchers.py` - -Read when relevant: - -- `py/src/braintrust/conftest.py` for VCR behavior -- `py/src/braintrust/integrations/auto_test_scripts/` for auto-instrument subprocess coverage -- the provider's existing wrapper tests and cassettes - -## Migration Playbook - -1. Inventory the wrapper before moving code. -2. Create the integration package with the public API and layout you intend to keep. -3. Move tracing helpers and wrapper logic into provider-local integration modules. -4. Move tests, cassettes, and auto-instrument subprocess coverage next to the integration. -5. Wire exports, `auto.py`, and `py/noxfile.py` to the new integration location. -6. Collapse the wrapper to compatibility imports and re-exports. -7. Run the narrowest provider test session first, then expand only if shared code changed. - -## Inventory First - -Before editing, list the wrapper's user-visible and repo-visible surface: - -- setup entry points such as `setup_()` -- public `wrap_*()` helpers -- deprecated aliases that must still import correctly -- `__all__` -- patch targets and target modules -- sync, async, and streaming code paths -- test files, cassette directories, and auto-test scripts -- any version-routing logic, especially `hasattr`-based fallback behavior - -Do not start moving files until this inventory is explicit. The migration succeeds only if the integration preserves the same behavior and import surface. - -## Package Layout - -Create `py/src/braintrust/integrations//` and keep provider-specific behavior there. - -Use this layout unless the provider already has a better current variant: - -- `__init__.py`: export the public API, `setup_()`, and compatibility aliases -- `integration.py`: define the `BaseIntegration` subclass and register patchers -- `patchers.py`: define patchers and public `wrap_*()` helpers -- `tracing.py`: keep spans, metadata extraction, stream handling, normalization, and helper code -- `test_.py` or split test files: keep provider behavior tests next to the integration -- `cassettes/`: keep VCR recordings next to the provider tests when the provider uses HTTP - -Keep `integration.py` thin. Do not move provider behavior into shared integration primitives unless the provider truly needs a shared change. - -## Public API Rules - -Preserve the wrapper's public surface exactly unless the task explicitly changes it. - -Keep or migrate: - -- setup function names -- `wrap_*()` helper names -- deprecated aliases -- `__all__` - -Make the integration package the source of truth. Make the wrapper import from the integration package, not the other way around. - -When the legacy wrapper is a single module such as `py/src/braintrust/wrappers/anthropic.py`, reduce that module to compatibility re-exports in place. When the wrapper is a package directory, reduce its `__init__.py` to compatibility re-exports and delete or stop importing the old implementation modules if they are no longer used. - -## Patching And Tracing Rules - -Move raw tracing behavior into `tracing.py`. - -Keep tracing wrappers as plain wrapt wrapper functions. Do not carry wrapper-era patch-state logic into tracing code: - -- no `is_patched` -- no `mark_patched` -- no `hasattr` branching to choose targets - -Move patch target selection into `patchers.py`. - -Prefer: - -- one `FunctionWrapperPatcher` per coherent target -- `CompositeFunctionWrapperPatcher` only when several related targets should appear as one patcher -- `superseded_by` for version-conditional fallback relationships - -When the legacy wrapper does "wrap `_run` if present, otherwise wrap `run`", convert that to separate patchers instead of reproducing the branching: - -- point the preferred patcher at the higher-priority target directly -- point the fallback patcher at the fallback target -- set `superseded_by` on the fallback patcher - -Use `py/src/braintrust/integrations/agno/patchers.py` as the reference pattern for this conversion. - -Expose manual wrapping through thin public helpers in `patchers.py`, then re-export them from `__init__.py`. - -## Test And Cassette Moves - -Move provider tests with the implementation. Do not strand coverage under `wrappers/`. - -Move or update: - -- provider behavior tests into `py/src/braintrust/integrations//` -- cassette directories into `py/src/braintrust/integrations//cassettes/` -- auto-instrument subprocess tests into `py/src/braintrust/integrations/auto_test_scripts/` when relevant - -Update imports, cassette paths, and fixtures during the move. - -Preserve coverage for the changed surfaces: - -- direct `wrap_*()` behavior -- setup-time patching -- sync behavior -- async behavior -- streaming behavior -- idempotence -- failure and logging behavior -- version-routing behavior when applicable - -Re-record cassettes only when behavior intentionally changes. - -## Repo Wiring - -Update only the shared surfaces required by the migration: - -- `py/src/braintrust/integrations/__init__.py` when the provider should be exported there -- `py/src/braintrust/auto.py` when `auto_instrument()` should use the integration -- `py/noxfile.py` so provider sessions point at the migrated integration tests - -Prefer narrow repo-level changes. Do not broaden shared integration code unless the migration cannot work without it. - -## Validation - -Run the smallest relevant session first: - -```bash -cd py && nox -s "test_(latest)" -cd py && nox -s "test_(latest)" -- -k "test_name" -cd py && nox -s "test_(latest)" -- --vcr-record=all -k "test_name" -``` - -Expand only when the migration touches shared code: - -```bash -cd py && make test-core -cd py && make lint -``` - -Also verify: - -- the old wrapper import path still works -- the old `wrap_*()` helpers still work -- deprecated aliases still resolve -- the relevant auto-instrument subprocess tests still pass if `auto.py` changed - -## Migration-Specific Pitfalls - -Avoid these failures: - -- copying wrapper code into `integrations/` without restructuring it around `__init__.py`, `integration.py`, `patchers.py`, and `tracing.py` -- leaving business logic or tracing helpers in the wrapper after the migration -- preserving wrapper-era `hasattr` or patch-state logic in tracing wrappers instead of using patcher primitives -- re-implementing target precedence with custom branching instead of `superseded_by` -- forgetting to move cassettes or auto-test scripts with the tests -- updating tests but forgetting `py/noxfile.py` -- breaking deprecated aliases, `__all__`, or old import paths -- changing shared integration code more broadly than the provider requires -- re-recording cassettes when behavior did not intentionally change diff --git a/AGENTS.md b/AGENTS.md index ecd80822..6da9e3a1 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,56 +1,96 @@ # Braintrust SDK Agent Guide -Guide for contributing to the Braintrust Python SDK repository. +This guide is for development of the Braintrust Python SDK in this repository. +If you need to learn more about Braintrust itself, see the Braintrust docs: https://www.braintrust.dev/docs -## Defaults +Use this file as the default playbook for work in this repository. -- Use `mise` as the source of truth for tools and environment. -- Prefer `py/` commands over root `make` targets when working on the SDK itself. -- Keep changes narrow and run the smallest relevant test session first. -- Do not rely on optional provider packages being installed unless the active nox session installs them. +## Core Rules + +1. **For SDK work, treat `py/` as the primary workspace.** + - Read files under `py/`. + - Run commands from `py/`. + - Prefer `py/` commands over repo-root wrappers unless the task is clearly repo-level. + +2. **Use `mise` as the source of truth for tools and environment.** + +3. **Do not guess test commands or version coverage.** + - `py/noxfile.py` is the source of truth for nox session names, provider/version matrices, and CI coverage. + - For provider and integration work, also check `py/src/braintrust/integrations/versioning.py`. + +4. **Keep changes narrow and validate with the smallest relevant test first.** + +5. **Default bug-fix workflow: red -> green.** + - First add or update a test that reproduces the issue. + - Then implement the fix. + - Only skip this if the user explicitly asks for a different approach. + +6. **Prefer real integration coverage over mocks.** + - For provider/integration behavior, prefer VCR-backed tests with checked-in cassettes. + - Avoid mocks/fakes unless the code is purely local or there is no practical cassette-based option. + +7. **Do not assume optional provider packages are installed.** + - Rely on the active nox session to install what it needs. + +8. **Do not add `from __future__ import annotations` unless absolutely required.** + - It can change runtime annotation behavior in ways that break introspection. + - Prefer quoted forward references or `TYPE_CHECKING` guards. ## Repo Map -- `py/`: main Python package, tests, examples, nox sessions, release build -- `py/benchmarks/`: pyperf performance benchmarks +- `py/`: main Python package, tests, examples, nox sessions, build/release workflow +- `py/src/braintrust/`: SDK source + - top-level package files: core SDK + - `wrappers/`: wrappers + - `integrations/`: integrations API + - `contrib/temporal/`: Temporal support + - `cli/`, `devserver/`: CLI and devserver + - `type_tests/`: static + runtime type tests + - colocated `test_*.py`: local unit/integration tests +- `py/benchmarks/`: pyperf benchmarks - `integrations/`: separate integration packages - `docs/`: supporting docs -Important code areas in `py/src/braintrust/`: - -- core SDK modules: top-level package files -- wrappers/integrations: `wrappers/` -- temporal: `contrib/temporal/` -- CLI/devserver: `cli/`, `devserver/` -- tests: colocated `test_*.py` -- type tests: `type_tests/` - ## Setup -Preferred repo bootstrap: +Repo bootstrap: ```bash mise install make develop ``` -Package-focused setup: +SDK-focused setup: ```bash cd py make install-dev ``` -Install optional provider dependencies only if needed: +Install optional provider dependencies only when needed: ```bash cd py make install-optional ``` -## Commands +## Default Workflow -Preferred SDK workflow: +When working on the SDK, prefer this sequence: + +```bash +cd py +``` + +1. Read the relevant code and tests. +2. Check `noxfile.py` for the exact session(s) that cover the change. +3. If fixing behavior, add/update a reproducing test first. +4. Make the smallest possible change. +5. Run the narrowest affected test session first. +6. Expand coverage only as needed. +7. Before handoff, run broader hygiene checks if the change is large enough to justify them. + +Common commands: ```bash cd py @@ -59,9 +99,28 @@ make test-core nox -l ``` -For larger or cross-cutting changes, also run `make pylint` from `py/` before handing work off. +Notes: + +- `cd py && make lint` runs pre-commit hooks and then `pylint`. +- `cd py && make pylint` runs only `pylint`. +- After major changes, run `cd py && make fixup` before handoff. +- The repo-root `Makefile` is a convenience wrapper; `py/Makefile` and `py/noxfile.py` are authoritative for SDK work. -Targeted wrapper/session runs: +## Testing Rules + +### Always check the real CI target + +Do not guess: + +- nox session names +- supported provider versions +- which tests a provider session runs + +Check `py/noxfile.py` and reproduce with the exact local session CI uses. + +### Run the smallest relevant test first + +Examples: ```bash cd py @@ -69,47 +128,55 @@ nox -s "test_openai(latest)" nox -s "test_openai(latest)" -- -k "test_chat_metrics" ``` -Root `Makefile` exists as a convenience wrapper. The authoritative SDK workflow is in `py/Makefile` and `py/noxfile.py`. - -## Tests +### Provider and integration changes -`py/noxfile.py` is the source of truth for compatibility coverage. +Version-specific behavior matters in this repo. -Testing preferences: +Before changing provider/integration behavior: -- Prefer VCR-backed integration tests with checked-in cassettes whenever practical. -- Avoid mocks, fakes, and heavily synthetic tests unless there is no reasonable cassette-based alternative or the code under test is truly internal/purely local. -- When fixing a bug or issue, default to a red/green workflow: first add or update a test that reproduces the problem and fails, then implement the fix, unless the user explicitly asks for a different approach. +1. Read the relevant session(s) in `py/noxfile.py`. +2. Read `py/src/braintrust/integrations/versioning.py`. +3. Confirm which versions, gates, fallbacks, and feature checks must keep working. +4. Do not stop at `latest` if the matrix includes older versions or version-specific branches. -Key facts: +### Key test facts - `test_core` runs without optional vendor packages. -- `test_types` runs pyright, mypy, and pytest on `py/src/braintrust/type_tests/`. Use this session when changing generic type signatures in the framework. +- `test_types` runs pyright, mypy, and pytest on `py/src/braintrust/type_tests/`. - wrapper coverage is split across dedicated nox sessions by provider/version. -- `pylint` installs the broad dependency surface before checking files. -- `cd py && make pylint` runs only `pylint`; `cd py && make lint` runs pre-commit hooks first and then `pylint`. - `test-wheel` is a wheel sanity check and requires a built wheel first. -When changing behavior, run the narrowest affected session first, then expand only if needed. - ## Type Tests -`py/src/braintrust/type_tests/` contains tests that are validated by both static type checkers (pyright, mypy) and pytest at runtime. The `test_types` nox session runs all three checks and is auto-discovered by CI. +Use `py/src/braintrust/type_tests/` when changing generic type signatures such as: + +- `Eval` +- `EvalCase` +- `EvalScorer` +- `EvalHooks` -When changing generic type signatures (e.g., `Eval`, `EvalCase`, `EvalScorer`, `EvalHooks`), add or update a test in `type_tests/` to verify the type checker accepts the intended usage patterns. +Rules: -New test files should be named `test_*.py` and use absolute imports (`from braintrust.framework import ...`). They are regular pytest files that also happen to be valid pyright/mypy targets. +- add or update a type test for the intended usage pattern +- name files `test_*.py` +- use absolute imports such as `from braintrust.framework import ...` + +Run with: ```bash cd py nox -s test_types ``` -## VCR +## VCR and Cassettes + +For provider and integration behavior, the default path is: -VCR/cassette coverage is the default and preferred testing strategy for provider and integration behavior in this repo. Reach for cassette-backed tests before introducing mocks or fakes, and keep new coverage aligned with the existing VCR patterns unless there is a strong reason not to. +1. reproduce with a failing cassette-backed test +2. implement the fix +3. re-run the affected session -VCR cassette directories: +Cassette locations: - `py/src/braintrust/cassettes/` - `py/src/braintrust/wrappers/cassettes/` @@ -120,8 +187,8 @@ Behavior from `py/src/braintrust/conftest.py`: - local default: `record_mode="once"` - CI default: `record_mode="none"` -- wheel-mode skips VCR-marked tests -- test fixtures inject dummy API keys and reset global state +- wheel mode skips VCR-marked tests +- fixtures inject dummy API keys and reset global state Common commands: @@ -132,9 +199,13 @@ nox -s "test_openai(latest)" -- --disable-vcr nox -s "test_openai(latest)" -- --vcr-record=all -k "test_openai_chat_metrics" ``` -Claude Agent SDK does not use VCR because the SDK talks to the bundled `claude` subprocess over stdin/stdout. Those tests use a transport-level cassette helper instead. +Claude Agent SDK note: -Common Claude Agent SDK cassette commands: +- it does not use HTTP VCR +- it talks to the bundled `claude` subprocess over stdin/stdout +- it uses transport-level cassette helpers instead + +Common Claude Agent SDK commands: ```bash cd py @@ -143,26 +214,32 @@ BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest BRAINTRUST_CLAUDE_AGENT_SDK_RECORD_MODE=all nox -s "test_claude_agent_sdk(latest)" -- -k "test_calculator_with_multiple_operations" ``` -Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If in doubt, ask the user. +Only re-record HTTP or subprocess cassettes when the behavior change is intentional. If unsure, ask the user. ## Benchmarks -Run `cd py && make bench` when touching hot-path code (serialization, deep-copy, span creation, logging). Not required for every change. - -Benchmarks use pyperf. All `bench_*.py` files in `py/benchmarks/benches/` are auto-discovered — no registration needed. +If you touch a hot path such as serialization, deep-copy, span creation, or logging, consider benchmarks. -Key commands: +Quick commands: ```bash cd py -make bench # run all benchmarks -make bench BENCH_ARGS="--fast" # quick sanity check -make bench BENCH_ARGS="-o /tmp/before.json" # save baseline before a change -make bench BENCH_ARGS="-o /tmp/after.json" # save after a change +make bench +make bench BENCH_ARGS="--fast" +make bench BENCH_ARGS="-o /tmp/before.json" +make bench BENCH_ARGS="-o /tmp/after.json" make bench-compare BENCH_BASE=/tmp/before.json BENCH_NEW=/tmp/after.json ``` -New benchmark files go in `py/benchmarks/benches/bench_.py`. Each must expose `main(runner: pyperf.Runner | None = None)`. Shared payload builders go in `py/benchmarks/fixtures.py`. See existing `bench_bt_json.py` for the pattern. +Rules: + +- benchmark hot-path changes when practical +- benchmark files live in `py/benchmarks/benches/` +- new files should be named `bench_.py` +- each benchmark file must expose `main(runner: pyperf.Runner | None = None)` +- shared payload builders belong in `py/benchmarks/fixtures.py` + +See `py/benchmarks/benches/bench_bt_json.py` for the pattern. ## Build Notes @@ -173,18 +250,27 @@ cd py make build ``` -Important caveat: +Caveat: -- `py/scripts/template-version.py` rewrites `py/src/braintrust/version.py` during build. -- `py/Makefile` restores that file afterward with `git checkout`. +- `py/scripts/template-version.py` rewrites `py/src/braintrust/version.py` during build +- `py/Makefile` restores that file afterward with `git checkout` Avoid editing `py/src/braintrust/version.py` while also running build commands. -## Editing Guidance +## Editing Guidelines -- Keep tests near the code they cover. +- Keep tests close to the code they cover. - Reuse existing fixtures and cassette patterns. - Prefer extending an existing cassette-backed test over adding a new mock-heavy test. - If a change affects examples or integrations, update the nearest example or focused test. - For CLI/devserver changes, consider whether wheel-mode behavior also needs coverage. -- Do **not** add `from __future__ import annotations` unless it is absolutely required (e.g., a genuine forward-reference that cannot be resolved any other way). This import changes annotation evaluation semantics at runtime and can silently break `get_type_hints()`, Pydantic models, and other runtime introspection. Prefer quoted string literals (`"MyClass"`) or `TYPE_CHECKING` guards for forward references instead. + +## Quick Decision Guide + +- **Changing SDK code?** Work from `py/`. +- **Need a test command?** Read `py/noxfile.py`. +- **Fixing a bug?** Add/update a failing test first. +- **Changing provider/integration behavior?** Use VCR-backed coverage and check version gates. +- **Changing generic typing?** Add/update a file in `py/src/braintrust/type_tests/` and run `nox -s test_types`. +- **Touching a hot path?** Consider `cd py && make bench`. +- **Preparing handoff after a major change?** Run `cd py && make fixup`. diff --git a/py/AGENTS.md b/py/AGENTS.md new file mode 100644 index 00000000..1118e26a --- /dev/null +++ b/py/AGENTS.md @@ -0,0 +1 @@ +See ../AGENTS.md for repository instructions. diff --git a/py/CLAUDE.md b/py/CLAUDE.md new file mode 120000 index 00000000..47dc3e3d --- /dev/null +++ b/py/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file