Skip to content

Epic: Testing Infrastructure & Strategy Overhaul #726

@planetf1

Description

@planetf1

Epic: Testing Infrastructure & Strategy Overhaul

Agreed outcomes from Discussion #711 and the 2026-03-23 planning call with @ajbozarth, @planetf1, @jakelorocco, and @avinash2692. cc @psschwei for further planning, @avinash2692 regarding Bluevela nightlies.

Key Decisions

Two-dimensional marker taxonomy — granularity (unit, integration, e2e, qualitative) x backend (ollama, huggingface, vllm, openai, watsonx, litellm, etc.), plus resource markers (requires_gpu, requires_heavy_ram, requires_gpu_isolation).

Tier Trigger Budget What runs
Pre-commit Every commit <60s Lint + type checking only
Local dev Ad-hoc <5 min All tests matching available backends/resources
PR CI Every push <15 min Unit + integration + Ollama e2e
Nightly CI Scheduled ~60 min Every test, no exceptions (Bluevela, full GPU)
Pre-release Manual ~90 min Manual trigger of nightly suite

Principles: split e2e into integration + e2e pairs (don't just downgrade); parametrise across backends; fix root causes over workarounds; catalog minimal default models with overrides; scope covers both tests and examples; docs updated with every change.

Work Items

# Issue Summary
1a #727 Granularity marker taxonomy and tiered timeouts
1b #728 Backend & resource marker audit (children: #622, #539, #629, #634)
2a #729 Split e2e tests into integration + e2e pairs
2b #730 Parametrise and consolidate backend-specific tests
3a #731 Environment diagnostic, pre-flight checks & reporting (children: #574, #349)
3b #732 Model consolidation and flexibility (children: #359)
4 #733 CI parallelisation and dynamic test selection (see also #451)
5 #734 On-demand nightly test runs for PRs
6 #735 Semantic assertions & recording for qualitative tests (children: #692)
7 #736 Backend resource cleanup post-PR #721
8 #737 Test results & coverage reporting
9 #738 Notebook testing (children: #89)
10 #739 Pre-commit & type checking (children: #456)

Related Issues

Expected to close with PR #721 (cleanup_gpu_backend()): #630, #625, #620, #699. Residual cleanup tracked in #736.

Flaky tests — addressed by #735 (semantic assertions): #398, #384, #628, #684, #121.

Not in scope: #691, #496, #347, #267 — remain standalone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions