Skip to content

feat(ai): AI Testing Framework — consolidation staging branch [6/8 → master]#411

Draft
ianwhitedeveloper wants to merge 9 commits intomasterfrom
ai-testing-framework-implementation-consolidation
Draft

feat(ai): AI Testing Framework — consolidation staging branch [6/8 → master]#411
ianwhitedeveloper wants to merge 9 commits intomasterfrom
ai-testing-framework-implementation-consolidation

Conversation

@ianwhitedeveloper
Copy link
Collaborator

@ianwhitedeveloper ianwhitedeveloper commented Feb 19, 2026

Context

This is the staging branch for the structured consolidation of draft PR #394 — the Riteway AI Testing Framework. Per Eric's consolidation request, PR #394 (80+ commits, 104 files, ~21K lines, ~60% docs/planning) is being decomposed into 8 small, focused PRs — one module per PR, in dependency order — each with functional requirements and unit tests, ruthlessly reviewed before merging here.

Current status: 6 of 8 PRs merged. PR 7 is in review; PR 8 is draft.

This branch is NOT ready to merge to master until all 8 PRs are merged and a final review passes.


Epic

Enable riteway ai <promptfile> — a CLI command that reads SudoLang test files, delegates execution to AI agents, and outputs results in TAP format. Treats prompts as first-class testable units, supporting configurable runs, pass thresholds, parallel execution, and rich TAP markdown output.

Full requirements: tasks/2026-01-22-riteway-ai-testing-framework.md


Why Not Cherry-Pick or Rebase PR #394?

  • 80+ commits interleave multiple modules — no clean per-module slices
  • Duplicate commits from prior rebases make cherry-pick impractical
  • ~60% of changed files are docs/planning that must stay out of production PRs
  • Circular dependency (ai-runner.jstest-extractor.js) needed to be resolved first

Approach: Fresh branches from this consolidation base, copy files from the feature branch, fix WIP issues during consolidation, review each PR independently before merging here.


Dependency Graph (module architecture)

ai-errors.js  (leaf)       constants.js  (leaf)
    ↓                           ↓
agent-parser.js  ←  ai-errors      [note: debug-logger removed in PR 5 cleanup]
extraction-parser.js  ←  ai-errors
execute-agent.js  ←  ai-errors, agent-parser
aggregation.js  ←  ai-errors, constants
    ↓
agent-config.js  ←  ai-errors                 [PR 4]
validation.js  ←  ai-errors                   [PR 4]
    ↓
test-extractor.js  ←  execute-agent           [PR 5]
ai-runner.js  ←  all prior                    [PR 5]
    ↓
test-output.js                                [PR 6]
ai-command.js  ←  all prior                   [PR 6]
bin/riteway.js  (modifications)               [PR 6]
    ↓
e2e.test.js  +  fixtures  +  config           [PR 7]
    ↓
agent-config.js (outputFormat + registry)     [PR 8]
ai-init.js  ←  agent-config                   [PR 8]
bin/riteway.js  (ai init subcommand)          [PR 8]

No cycles. Every module has a colocated test file.


8-PR Progress

# PR Files Status
1 Foundation — Error Types + Constants ai-errors.js, constants.js + tests ✅ Merged (#407)
2 Utilities — Concurrency Limiter + TAP YAML limit-concurrency.js, tap-yaml.js + tests (debug-logger added then removed in PR 5 cleanup) ✅ Merged (#408)
3 Parsers + Execute Agent agent-parser, extraction-parser, aggregation, execute-agent + tests ✅ Merged (#409)
4 Config + Validation agent-config, validation + tests + fixtures ✅ Merged (#410)
5 Test Extractor + Core Runner test-extractor, ai-runner + tests; debug-logger removed across all modules ✅ Merged (#416)
6 Test Output + CLI Integration test-output, ai-command, bin/riteway + tests ✅ Merged (#420)
7 E2E Tests + Fixtures + Config e2e.test.js, fixtures, vitest config; test-extractor.js post-consolidation fixups 🔍 In review (#421)
8 outputFormat strategy + riteway ai init execute-agent, agent-config (outputFormat + registry), ai-init.js (new), bin/riteway (init subcommand), README 📝 Draft (#423)

Current test count: 190 passing (PRs 1–6 merged). PR 7 adds 6 E2E tests (npm run test:e2e); PR 8 brings unit tests to 211.


WIP Issues From Original PR (13 total — all resolved)

# Issue Status
1 for (const loops in tests ✅ Zero instances — resolved
2 agent-config schema comment verbose ✅ Resolved in PR 4
3 Fixtures README outdated ✅ Resolved in PRs 7 & 8
4 formatMedia dead code ✅ Removed in PR 6
5 test-output.js dead call ✅ Removed with #4 in PR 6
6 Redundant test comments ✅ None found in PRs 1–6
7 Try(() => fn(args)) syntax ✅ Valid — no change
8 ai-runner logger coupling ✅ debug-logger removed entirely in PR 5 cleanup
9 unwrapRawEnvelope duplication ✅ Resolved in PR 3 (shared unwrapEnvelope)
10 Cursor agent --trust flag ✅ Resolved in PR 4
11 Hardcoded defaults in tests ✅ Explicit per TDD rules
12 Error handling/Zod placement z.prettifyError inline in PR 5; AgentConfig* errors in PR 5
13 Re-exports in test-extractor.js ✅ Removed in PR 5

Architectural Questions (surfaced in PR 4 — both resolved in PR 8)

1. Built-in agent configs hardcode third-party CLI flags

Resolved: riteway ai init (PR 8) writes all built-in configs to riteway.agent-config.json. Teams who want stability own their config file. Library built-ins remain for first-run convenience.

2. parseOutput function can't live in a JSON config file

Resolved: PR 8 replaces parseOutput: fn with declarative outputFormat: 'json' | 'ndjson' | 'text' string in all agent configs and the agentConfigFileSchema. execute-agent.js maps format names to parsers via a lookup table. Config is now fully serializable.


Merge Plan

  1. Each topic PR targets this branch (not master)
  2. Agent + human review before each merge
  3. When all 8 are merged here and tests are green: final review, then PR this → master

@ericelliott
Copy link
Collaborator

I'm okay with the strategies here.

ianwhitedeveloper and others added 6 commits February 25, 2026 14:39
* feat(ai): add error types, constants, and Zod schemas (PR 1/7)

Foundation layer for the AI testing framework. Introduces structured
error handling via error-causes and runtime-validated configuration
constants via Zod schemas. Updates eslint ecmaVersion to 2022 to
support numeric separators and optional chaining used throughout
the framework source.

Files:
- source/ai-errors.js — named error types (ParseError, ValidationError, etc.)
- source/ai-errors.test.js — full coverage for error descriptors and createError
- source/constants.js — defaults, constraints, and Zod schemas
- source/constants.test.js — 26 tests covering all schemas and boundaries
- eslint.config.js — bump ecmaVersion 2017 → 2022 (prerequisite)
- package.json — add error-causes and zod to production dependencies

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(config): bring working configs from feature branch

Adds vitest.config.js e2e exclusion (source/e2e.test.js uses Riteway/Tape,
not Vitest) alongside the eslint ecmaVersion 2022 bump already in place.
Both changes are sourced from the working feature branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 1 review findings

- constants.js: lazy process.cwd() default (z.string().default(() => process.cwd()))
  prevents stale value when cwd changes after module load
- constants.js: add concurrencyMax (50) to constraints + enforce in concurrencySchema
- constants.js: remove JSDoc from internal constants (not public API)
- constants.test.js: add full aiTestOptionsSchema coverage (valid input, missing
  filePath, empty filePath, invalid agent, lazy cwd default, optional agentConfigPath)
- constants.test.js: add concurrencySchema upper-bound tests
- ai-errors.test.js: replace for..of loops with test.each (one named test per case)
- ai-errors.test.js: expand createError integration to cover two error types
- ai-errors.test.js: replace typeof handleAIErrors check with behavioral routing tests
- ai-errors.js: remove forward-reference comment (extraction-parser.js not yet in scope)
- eslint.config.js: Object.assign -> spread operator

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
… 2/7] (#408)

* feat(ai): Utilities — Debug Logger, Concurrency Limiter, TAP YAML [PR 2/7]

- Add createDebugLogger: console + file logging with buffer/flush
- Add limitConcurrency: sliding-window async concurrency limiter
- Add parseTAPYAML: parse judge agent TAP YAML diagnostic blocks
- Add limit-concurrency.test.js (missing from PR #394)
- Apply js.mdc cleanup: flush loop → single write, for-of → reduce pipeline
- Replace @paralleldrive/cuid2 (not in deps) with mkdtempSync in debug-logger.test.js

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(ai): apply PR 2 review suggestions

- Collapse formatMessage to concise arrow expression
- Add comment to limit-concurrency for-of loop (justified async pattern)
- Add flush no-op test when logFile is not configured
- Use vi.useFakeTimers() in concurrency-cap test for determinism

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Add agent-parser, extraction-parser, aggregation, and execute-agent
modules with full unit test coverage.

- agent-parser: parseStringResult, parseOpenCodeNDJSON, unwrapEnvelope
  (new shared export), unwrapAgentResult. Shared unwrapEnvelope breaks
  duplication between agent-parser and execute-agent (WIP fix #9).
- extraction-parser: parseExtractionResult with multi-strategy JSON
  parsing (direct, markdown fence, pre-parsed object), and
  resolveImportPaths for prompt file resolution.
- aggregation: normalizeJudgment, calculateRequiredPasses,
  aggregatePerAssertionResults with Zod validation.
- execute-agent: extracted from ai-runner.js to break the circular
  dependency (ai-runner ↔ test-extractor). Logger injected at
  executeAgent call site rather than created inside spawnProcess
  (WIP fix #8). Uses shared unwrapEnvelope from agent-parser.
- Test files use test.each for all table-driven cases per convention.

164 tests pass, 0 lint errors, TypeScript checks pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 code review findings

- aggregation.js: validate once in aggregatePerAssertionResults — capture
  the Zod-validated result and compute Math.ceil inline, eliminating the
  redundant second schema parse inside calculateRequiredPasses
- aggregation.js: remove misleading optional chaining (raw?.passed etc.)
  after the null-guard throw; use plain property access
- agent-parser.js: replace acc.push() with [...acc, text] in reduce
  accumulator to prefer immutability per JS style guide
- agent-parser.test.js: drop redundant "parsed object:" prefix from
  unwrapEnvelope test.each given fields; remove duplicate standalone
  "no result key" test that overlapped with test.each row
- aggregation.test.js: remove redundant export-existence assertion for
  normalizeJudgment; add empty perAssertionResults edge case (vacuous
  truth — every() on [] returns true)
- execute-agent.test.js: strengthen parseOutput test to verify stdout
  and logger are threaded through as expected (documents WIP fix #8)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 author review findings

- aggregation.js: rename `raw` param to `judgeResponse` and
  fold into single options object for normalizeJudgment; removes
  the two-argument signature (breaking change, callers updated)
- aggregation.js: remove calculateRequiredPasses — math is inlined
  in aggregatePerAssertionResults, eliminating double schema parse
- aggregation.test.js: remove calculateRequiredPasses describe block;
  fix Try() usage (direct fn ref, not arrow wrapper); update all
  normalizeJudgment call sites to new single-options signature
- execute-agent.js: extract magic number 500 to maxOutputPreviewLength
  constant (camelCase per javascript.mdc); applied to all 3 truncation sites
- execute-agent.test.js: replace try/catch antipatterns with await Try();
  add Try import from riteway.js
- extraction-parser.test.js: strengthen weak typeof assertions to check
  specific fields; strengthen cause !== undefined to cause.name === SyntaxError

151 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 follow-up review findings

- constants.js: rename calculateRequiredPassesSchema to
  aggregationParamsSchema — name now reflects what the schema
  validates (aggregation input params) rather than the deleted
  calculateRequiredPasses function; update all import sites
- aggregation.test.js: add 6 missing Zod validation edge cases
  for aggregatePerAssertionResults (zero runs, negative runs,
  non-integer runs, NaN runs, negative threshold, NaN threshold)
  — coverage gap introduced when calculateRequiredPasses and its
  tests were removed; all cases now exercised via
  aggregatePerAssertionResults test.each

157 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(test): complete PR review remediation

🐛 - Remove weak instanceof Error assertions

🔄 - Add threshold calculation verification tests

Tests now verify threshold-based pass/fail logic directly

164 tests passing, 0 lint errors, TypeScript clean

Co-authored-by: Ian White <ian.white.developer@gmail.com>

* fix(ai): remove implementation detail from test

- execute-agent.test.js: remove logger type assertion from
  parseOutput test — typeof checks violate tdd.mdc:64 and
  logger threading is an implementation detail; the three
  remaining assertions (call count, stdout arg, parsed result)
  collectively verify correct integration

164 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
- test(ai-errors): remove error-causes API tests; keep only
  handleAIErrors behavioral routing (ericelliott/janhesters #407)
- test(constants): remove defaults/constraints value-only blocks;
  replace tautological expected: defaults.X with literals (ericelliott #407)
- fix(debug-logger): rename writeToFile→bufferEntry, process→logProcess
  export; add logFile type guard; circular ref safety in formatMessage;
  command() rest params; improved JSDoc (janhesters #408)
- test(debug-logger): onTestFinished for all teardown; add circular
  ref and logFile TypeError tests; flush no-op debug:false (janhesters #408)
- fix(limit-concurrency): guard non-positive limit with RangeError;
  onTestFinished for fake timer teardown; document fail-fast (janhesters #408)
- test(agent-parser): replace partial assertions with full expected
  values including ndjsonLength (janhesters #409)
- test(extraction-parser): replace 4x multi-assert blocks with single
  full-object assertions (janhesters #409)

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(agent-parser): use full expected values

- Replace JSON.stringify comparisons with direct object assertions
- Collapse 4 partial error.cause assertions into single full-object
  assert in parseOpenCodeNDJSON error test
- Expand partial error.cause?.name assertion to full cause object
  in unwrapAgentResult error test

Addresses Jan's PR #409 comment: deterministic functions should assert
the complete expected value, not individual properties.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(ai): replace partial assertions with full expected values

Per Jan's review: deterministic functions should assert the complete
expected value, not individual properties.

- extraction-parser: collapse ExtractionParseError and
  ExtractionValidationError cause assertions to full objects;
  comment .name usage (SyntaxError sets it as own property)
- tap-yaml: consolidate per-property result asserts to full objects;
  collapse error cause to single full-object assert; remove redundant
  typeof score check
- execute-agent: collapse AgentProcessError, TimeoutError, ParseError
  cause assertions to full objects; comment 3-deep .cause chain
- aggregation: collapse ValidationError and ParseError test.each cause
  assertions to full objects; comment .constructor.name (ZodError
  does not set .name as own property); remove standalone
  normalizeJudgment ParseError test made redundant by test.each

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
- agent-config: getAgentConfig() for claude/opencode/cursor agents
- agent-config: loadAgentConfig() reads + validates JSON config files
- validation: validateFilePath() guards against path traversal
- validation: verifyAgentAuthentication() smoke-tests agent availability
- fixtures: test-agent-config.json, invalid-agent-config.txt, no-command-agent-config.json

WIP fixes applied:
- #2: replace verbose schema JSDoc with single-line YAGNI comment
- #10: add --trust flag to cursor agent args for non-interactive execution

182 tests passing (19 new: 10 agent-config + 9 validation).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 4 code review findings

- add direct unit tests for formatZodError (4 cases, both code paths)
- simplify parseJson: remove unnecessary currying → plain two-arg fn
- remove spurious await on synchronous parseJson call
- convert multi-line string concat to template literal in validation.js
- rename misleading test: 'uses default timeout' → 'succeeds without explicit timeout argument'

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
- ai-errors: export allNoop helper for exhaustive handleAIErrors tests
- extraction-parser: spread ...ValidationError instead of bare name string
- tap-yaml: spread ...ParseError instead of bare name string
- constants.test: replace safeParse leakage with parse()+Try(); trim to behavioral tests only
- aggregation.test: full-object assertions; fix duplicate test.each labels
- extraction-parser.test: add resolveImportPaths success/error tests; non-object branch test
- execute-agent.test: add spawn failure, malformed JSON fallback, ParseError envelope tests
- ai-errors.test.js: removed; handleAIErrors routing
  already covered by agent-parser, aggregation,
  extraction-parser, and tap-yaml test suites

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
@ianwhitedeveloper ianwhitedeveloper force-pushed the ai-testing-framework-implementation-consolidation branch from a96a49d to 6c81837 Compare February 25, 2026 20:39
ianwhitedeveloper and others added 2 commits February 26, 2026 11:57
debug/logFile/logger params were never in the formal requirements, never
exposed via the CLI, and never tested end-to-end. logFile was UAT
scaffolding already broken in two places. Removing the abstraction
simplifies every public signature and eliminates logger threading.

- Delete debug-logger.js and debug-logger.test.js (−417 lines)
- Drop debug/logFile/logger params from execute-agent, agent-parser,
  aggregation, extraction-parser, validation public signatures
- Convert user-visible progress messages to console.log/console.warn
- Delete internal diagnostic noise throughout
- Remove debug/debugLog fields from constants.js defaults and schema
- Extract truncateOutput helper in execute-agent.js (eliminates duplication)
- Convert resolveImportPaths to named params { importPaths, projectRoot }
- Replace manual zodError.issues mapping with z.prettifyError in aggregation
- Full expected-value assertions and allNoop spread pattern in test files

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add AgentConfigReadError, AgentConfigParseError, AgentConfigValidationError to ai-errors.js
- Update agent-config.js to use specific AgentConfig* error types; z.prettifyError() inline
- Update agent-config.test.js to use handleAIErrors routing pattern throughout
- Add test-extractor.js: buildExtractionPrompt, buildResultPrompt, buildJudgePrompt, extractTests
- Add ai-runner.js: runAITests, verifyAgentAuthentication
- Add @paralleldrive/cuid2 dependency for hermetic test temp-dir naming

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add source/test-output.js: formatTAP, recordTestOutput,
  openInBrowser with open package dependency
- Add source/ai-command.js: parseAIArgs, runAICommand,
  formatAssertionReport; remove debug/debugLog params,
  use z.prettifyError instead of formatZodError
- Update bin/riteway.js: add riteway ai <file> subcommand
  with exhaustive handleAIErrors (all 12 error types)
- Add tests for all new modules (42 new tests)
- Fix tap-yaml.js TS: JSDoc cast on reduce initial value
- Remove formatMedia dead code (WIP #4/#5)
- Remove generateLogFilePath (debug-logger removed in PR 5)

Made-with: Cursor
@ianwhitedeveloper ianwhitedeveloper changed the title feat(ai): AI Testing Framework — consolidation staging branch [0/7 → master] feat(ai): AI Testing Framework — consolidation staging branch [6/8 → master] Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants