Skip to content

[#985][feat][eval] Add per-phase time tracking and planning timeout#987

Open
ayazhankadessova wants to merge 12 commits intomainfrom
issue-985
Open

[#985][feat][eval] Add per-phase time tracking and planning timeout#987
ayazhankadessova wants to merge 12 commits intomainfrom
issue-985

Conversation

@ayazhankadessova
Copy link
Copy Markdown
Contributor

Summary

  • Add planning_time and impl_time per-task fields to eval harness results
  • Add --planning-timeout CLI flag (default 600s) replacing hardcoded timeout // 2
  • New planning_timeout status when planning exceeds the configured limit
  • Extend aggregate_metrics with planning_time_total/mean, impl_time_total/mean, planning_timeouts
  • Phase timing breakdown in _print_summary output
  • New python/agentize/eval/README.md folder description

Test plan

  • 68/68 Python tests pass
  • New tests: timing field defaults, planning_timeout status, phase timing aggregation
  • Updated existing nlcmd timeout test for new planning_timeout status
  • Backward compatible: raw/impl modes have planning_time=0.0

Closes #985

🤖 Generated with Claude Code

Ayazhan Kadessova added 3 commits April 6, 2026 20:14
…EADME

- python/agentize/eval/eval_harness.md: Document --planning-timeout flag
  and planning_timeout status for full/nlcmd modes
- python/agentize/eval/README.md: New folder README describing eval contents
- python/tests/test_eval_harness.md: Expand scope to include planning-timeout
  and timing fields

Related: #985
- TestMakeResult: Assert planning_time/impl_time defaults
- TestAggregateMetricsPhaseTimming: planning/impl time aggregation,
  planning_timeouts counter, zero timing when fields absent
- TestNlcmdImpl: planning_timeout status, planning_time recording

Tests expected to fail until implementation is complete.

Related: #985
- eval_harness.py: Add planning_time/impl_time to _make_result, record
  phase boundaries in run_nlcmd_impl (planning_start/impl_start) and
  run_full_impl (timing_bucket). Add --planning-timeout CLI flag (default
  600s) replacing hardcoded timeout//2. New planning_timeout status when
  planning exceeds limit. Extend aggregate_metrics with planning_time_*,
  impl_time_*, and planning_timeouts. Update _print_summary to show
  phase timing breakdown.
- test_eval_harness.py: Add TestMakeResult timing fields, planning_timeout
  status test, planning_time recording test, aggregate metrics phase timing
  tests. Update existing timeout test for new planning_timeout status.

All 68 tests pass.

Related: #985
@ayazhankadessova ayazhankadessova added the agentize:pr PR created by agentize label Apr 6, 2026
Ayazhan Kadessova added 9 commits April 7, 2026 03:25
- Pass timing_bucket list into _run_full_impl_body to record planning_time
  at the split point (after run_planning_phase, before FSM)
- Extract planning_time from bucket after thread completes, derive impl_time
- Show per-task phase breakdown in output: "Time: 400s (plan: 300s, impl: 100s)"
- Planning/impl timing now visible in both per-task output and summary

Related: #985
- Wrap run_planning_phase() in a daemon sub-thread inside
  _run_full_impl_body with planning_timeout join deadline
- If planning exceeds the limit, fall back to raw problem statement
  and continue to impl phase (still records planning_time)
- Pass planning_timeout through run_full_impl → _run_full_impl_body
- Previously --planning-timeout only worked for nlcmd mode

Related: #985
- Write full result dicts (including planning_time, impl_time, cost_usd)
  to results.jsonl alongside predictions.jsonl and metrics.json
- Incremental save after each task (survives crashes)
- Enables post-run analysis of per-task phase timing without re-running

Related: #985
- Per-task planning vs impl time table (5 tasks, --planning-timeout 600)
- Planning accounts for 51-93% of total time (mean 73%)
- 2/5 tasks hit 600s planning timeout
- Per-agent timing ranges from logs

Related: #985
… nlcmd and codex

- Add raw opus timing breakdown (0% planning, all impl)
- Restructure as 4 sections: raw opus, full, nlcmd (TBD), codex impl (TBD)
- Key observations section comparing across modes

Related: #985
- nlcmd: 2/5 tasks, all hit 600s planning timeout (partial plans found)
- codex impl: 2/5 tasks, 1-2 iterations (completion marker ~70% reliable)
- Note probabilistic Codex completion behavior

Related: #985
- Raw opus: $16.77, 23 min, 10/20 resolved (same rate as sonnet, pending scoring)

Related: #985
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentize:pr PR created by agentize

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan][eval] Add per-phase time tracking and planning timeout to eval harness

2 participants