Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 38 additions & 11 deletions .agents/skills/sdk-ci-triage/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,29 +34,32 @@ Read when relevant:

- `lint`: pre-commit diff-based checks
- `ensure-pinned-actions`: workflow hygiene
- `static_checks`: Ubuntu-only Python matrix for `pylint` and `test_types`
- `smoke`: install/import matrix across Python and OS
- `nox`: provider and core test matrix, sharded through `py/scripts/nox-matrix.py`
- `adk-py`: reusable workflow for ADK coverage
- `langchain-py`: reusable workflow for LangChain coverage
- `upload-wheel`: build wheel sanity check

The most common failure source is the `nox` matrix job.
The most common failure source is still the `nox` matrix job, but `pylint` and `test_types` failures now surface through `static_checks`, not through `nox`.

## Standard Workflow

1. Identify the failing PR, run, or job.
2. Inspect the failing job logs with `gh`.
3. Determine which workflow branch failed:
- `lint`
- `static_checks`
- `smoke`
- `nox`
- reusable workflow (`adk-py`, `langchain-py`)
- `upload-wheel`
4. For `nox` failures, map the matrix job to the exact nox session and pinned provider version from the logs.
5. Reproduce the narrowest failing command locally.
6. Fix the bug.
7. Re-run the narrowest failing command first.
8. Expand only if shared code changed.
5. For `static_checks` failures, identify whether `pylint` or `test_types` failed under the reported Python version.
6. Reproduce the narrowest failing command locally.
7. Fix the bug.
8. Re-run the narrowest failing command first.
9. Expand only if shared code changed.

Do not start by running the whole suite locally unless the failure genuinely spans many sessions.

Expand Down Expand Up @@ -94,25 +97,29 @@ gh api repos/braintrustdata/braintrust-sdk-python/actions/jobs/<job-id>/logs
Job names look like this:

```text
nox (3.10, ubuntu-latest, 0)
nox (3.10, ubuntu-24.04, 0)
```

That means:

- Python `3.10`
- OS `ubuntu-latest`
- OS `ubuntu-24.04`
- shard `0` out of 4

The workflow runs:

```bash
mise exec python@<python-version> -- python ./py/scripts/nox-matrix.py <shard> 4
mise exec python@<python-version> -- python ./py/scripts/nox-matrix.py <shard> 4 \
--exclude-session pylint \
--exclude-session test_types
```

Use a dry run first to see which sessions belong to the shard:

```bash
mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run
mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run \
--exclude-session pylint \
--exclude-session test_types
```

Then inspect the failing logs to find the exact session name, for example:
Expand Down Expand Up @@ -161,6 +168,23 @@ make lint
make pylint
```

### `static_checks`

The `static_checks` job is an Ubuntu-only Python matrix that runs `pylint` and `test_types` together for each configured Python version.

Local equivalents:

```bash
mise exec python@3.10 -- nox -f ./py/noxfile.py -s pylint test_types
```

If only one of the two sessions failed in CI, narrow locally to that specific session:

```bash
mise exec python@3.10 -- nox -f ./py/noxfile.py -s pylint
mise exec python@3.10 -- nox -f ./py/noxfile.py -s test_types
```

### `smoke`

The smoke job validates install + import across OS and Python versions.
Expand Down Expand Up @@ -276,7 +300,9 @@ Preferred progression:

```bash
# 1. Inspect the failing shard
mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run
mise exec python@3.10 -- python ./py/scripts/nox-matrix.py 0 4 --dry-run \
--exclude-session pylint \
--exclude-session test_types

# 2. Reproduce the exact session
cd py
Expand All @@ -299,7 +325,7 @@ When answering a CI-triage question, report:
Good example structure:

```text
The failing job is `nox (3.10, ubuntu-latest, 0)`.
The failing job is `nox (3.10, ubuntu-24.04, 0)`.
Within that shard, the failing session is `test_google_genai(1.30.0)`.
The root cause is that the tests import a symbol that does not exist in google-genai 1.30.0, even though it exists in newer versions.
You can reproduce it locally with `cd py && nox -s "test_google_genai(1.30.0)"`.
Expand All @@ -311,6 +337,7 @@ The fix is to gate the behavior for older versions or stop assuming the newer AP
Avoid these common mistakes:

- guessing the session from the provider name without checking `py/noxfile.py`
- forgetting that CI excludes `pylint` and `test_types` from the sharded `nox` job
- reproducing with `latest` when CI failed on an older pinned version
- running from repo root when the real SDK command belongs in `py/`
- fixing the symptom in tests without understanding the provider-version contract
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/adk-py-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ on:

jobs:
test:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 15

steps:
Expand Down
44 changes: 33 additions & 11 deletions .github/workflows/checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ permissions:

jobs:
lint:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 10
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
Expand All @@ -26,13 +26,31 @@ jobs:
mise exec -- pre-commit run --from-ref origin/${{ github.base_ref || 'main' }} --to-ref HEAD

ensure-pinned-actions:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 5
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
- name: Ensure SHA pinned actions
uses: zgosalvez/github-actions-ensure-sha-pinned-actions@70c4af2ed5282c51ba40566d026d6647852ffa3e # v5.0.1

static_checks:
runs-on: ubuntu-24.04
timeout-minutes: 20
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
- name: Setup Python environment
uses: ./.github/actions/setup-python-env
with:
python-version: ${{ matrix.python-version }}
- name: Run pylint and type tests
shell: bash
run: |
mise exec python@${{ matrix.python-version }} -- nox -f ./py/noxfile.py -s pylint test_types

smoke:
runs-on: ${{ matrix.os }}
timeout-minutes: 20
Expand All @@ -41,7 +59,7 @@ jobs:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
os: [ubuntu-latest, windows-latest]
os: [ubuntu-24.04, windows-2025]

steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
Expand All @@ -66,7 +84,7 @@ jobs:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
os: [ubuntu-latest, windows-latest]
os: [ubuntu-24.04, windows-2025]
shard: [0, 1, 2, 3]

steps:
Expand All @@ -78,7 +96,9 @@ jobs:
- name: Run nox tests (shard ${{ matrix.shard }}/4)
shell: bash
run: |
mise exec python@${{ matrix.python-version }} -- python ./py/scripts/nox-matrix.py ${{ matrix.shard }} 4
mise exec python@${{ matrix.python-version }} -- python ./py/scripts/nox-matrix.py ${{ matrix.shard }} 4 \
--exclude-session pylint \
--exclude-session test_types

adk-py:
uses: ./.github/workflows/adk-py-test.yaml
Expand All @@ -90,7 +110,7 @@ jobs:
needs:
- smoke
- nox
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 10
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
Expand All @@ -114,12 +134,13 @@ jobs:
needs:
- lint
- ensure-pinned-actions
- static_checks
- smoke
- nox
- adk-py
- langchain-py
- upload-wheel
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 5
if: always()
steps:
Expand All @@ -138,12 +159,13 @@ jobs:
}

check_result "lint" "${{ needs.lint.result }}"
check_result "ensure-pinned-actions" "${{ needs.ensure-pinned-actions.result }}"
check_result "ensure-pinned-actions" "${{ needs['ensure-pinned-actions'].result }}"
check_result "static_checks" "${{ needs.static_checks.result }}"
check_result "smoke" "${{ needs.smoke.result }}"
check_result "nox" "${{ needs.nox.result }}"
check_result "adk-py" "${{ needs.adk-py.result }}"
check_result "langchain-py" "${{ needs.langchain-py.result }}"
check_result "upload-wheel" "${{ needs.upload-wheel.result }}"
check_result "adk-py" "${{ needs['adk-py'].result }}"
check_result "langchain-py" "${{ needs['langchain-py'].result }}"
check_result "upload-wheel" "${{ needs['upload-wheel'].result }}"

if [ "$FAILED" -ne 0 ]; then
echo "One or more required checks failed"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/langchain-py-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ on:

jobs:
test:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
timeout-minutes: 15

steps:
Expand Down
7 changes: 5 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ Use this file as the default playbook for work in this repository.
2. **Use `mise` as the source of truth for tools and environment.**

3. **Do not guess test commands or version coverage.**
- `py/noxfile.py` is the source of truth for nox session names, provider/version matrices, and CI coverage.
- `py/noxfile.py` is the source of truth for nox session names, provider/version matrices, and local reproduction commands.
- `.github/workflows/checks.yaml` is the source of truth for which sessions run in CI, on which Python versions, and outside vs. inside the nox shard matrix.
- For provider and integration work, also check `py/src/braintrust/integrations/versioning.py`.

4. **Keep changes narrow and validate with the smallest relevant test first.**
Expand Down Expand Up @@ -116,7 +117,7 @@ Do not guess:
- supported provider versions
- which tests a provider session runs

Check `py/noxfile.py` and reproduce with the exact local session CI uses.
Check `py/noxfile.py` and `.github/workflows/checks.yaml`, then reproduce with the exact local session CI uses.

### Run the smallest relevant test first

Expand All @@ -143,6 +144,8 @@ Before changing provider/integration behavior:

- `test_core` runs without optional vendor packages.
- `test_types` runs pyright, mypy, and pytest on `py/src/braintrust/type_tests/`.
- CI runs `pylint` and `test_types` via the dedicated `static_checks` workflow job on Ubuntu across the configured Python matrix, not inside the sharded `nox` job.
- The sharded `nox` workflow excludes `pylint` and `test_types`; use `py/scripts/nox-matrix.py --exclude-session ...` when reproducing shard membership locally.
- wrapper coverage is split across dedicated nox sessions by provider/version.
- `test-wheel` is a wheel sanity check and requires a built wheel first.

Expand Down
10 changes: 9 additions & 1 deletion py/scripts/nox-matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
by weight descending and greedily assigns each to the lightest shard.

Usage:
python nox-matrix.py <shard_index> <number_of_shards> [--dry-run]
python nox-matrix.py <shard_index> <number_of_shards> [--dry-run] [--exclude-session <name> ...]
"""

import argparse
Expand Down Expand Up @@ -80,6 +80,12 @@ def main() -> None:
parser.add_argument("shard_index", type=int, help="Zero-based shard index")
parser.add_argument("num_shards", type=int, help="Total number of shards")
parser.add_argument("--dry-run", action="store_true", help="Print assignment without running nox")
parser.add_argument(
"--exclude-session",
action="append",
default=[],
help="Exclude a nox session from shard assignment. May be passed multiple times.",
)
parser.add_argument(
"--output-durations",
type=Path,
Expand Down Expand Up @@ -108,6 +114,8 @@ def main() -> None:
weights_file = root_dir / "py" / "scripts" / "session-weights.json"

all_sessions = get_nox_sessions(noxfile)
excluded_sessions = set(args.exclude_session)
all_sessions = [session for session in all_sessions if session not in excluded_sessions]
weights, default_weight = load_weights(weights_file)
shard_assignments = assign_shards(all_sessions, args.num_shards, weights, default_weight)

Expand Down
Loading