-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Epic: Testing Infrastructure & Strategy Overhaul
Agreed outcomes from Discussion #711 and the 2026-03-23 planning call with @ajbozarth, @planetf1, @jakelorocco, and @avinash2692. cc @psschwei for further planning, @avinash2692 regarding Bluevela nightlies.
Key Decisions
Two-dimensional marker taxonomy — granularity (unit, integration, e2e, qualitative) x backend (ollama, huggingface, vllm, openai, watsonx, litellm, etc.), plus resource markers (requires_gpu, requires_heavy_ram, requires_gpu_isolation).
| Tier | Trigger | Budget | What runs |
|---|---|---|---|
| Pre-commit | Every commit | <60s | Lint + type checking only |
| Local dev | Ad-hoc | <5 min | All tests matching available backends/resources |
| PR CI | Every push | <15 min | Unit + integration + Ollama e2e |
| Nightly CI | Scheduled | ~60 min | Every test, no exceptions (Bluevela, full GPU) |
| Pre-release | Manual | ~90 min | Manual trigger of nightly suite |
Principles: split e2e into integration + e2e pairs (don't just downgrade); parametrise across backends; fix root causes over workarounds; catalog minimal default models with overrides; scope covers both tests and examples; docs updated with every change.
Work Items
| # | Issue | Summary |
|---|---|---|
| 1a | #727 | Granularity marker taxonomy and tiered timeouts |
| 1b | #728 | Backend & resource marker audit (children: #622, #539, #629, #634) |
| 2a | #729 | Split e2e tests into integration + e2e pairs |
| 2b | #730 | Parametrise and consolidate backend-specific tests |
| 3a | #731 | Environment diagnostic, pre-flight checks & reporting (children: #574, #349) |
| 3b | #732 | Model consolidation and flexibility (children: #359) |
| 4 | #733 | CI parallelisation and dynamic test selection (see also #451) |
| 5 | #734 | On-demand nightly test runs for PRs |
| 6 | #735 | Semantic assertions & recording for qualitative tests (children: #692) |
| 7 | #736 | Backend resource cleanup post-PR #721 |
| 8 | #737 | Test results & coverage reporting |
| 9 | #738 | Notebook testing (children: #89) |
| 10 | #739 | Pre-commit & type checking (children: #456) |
Related Issues
Expected to close with PR #721 (cleanup_gpu_backend()): #630, #625, #620, #699. Residual cleanup tracked in #736.
Flaky tests — addressed by #735 (semantic assertions): #398, #384, #628, #684, #121.