[#985][feat][eval] Add per-phase time tracking and planning timeout by ayazhankadessova · Pull Request #987 · Synthesys-Lab/agentize

ayazhankadessova · 2026-04-06T20:19:36Z

Summary

Add planning_time and impl_time per-task fields to eval harness results
Add --planning-timeout CLI flag (default 600s) replacing hardcoded timeout // 2
New planning_timeout status when planning exceeds the configured limit
Extend aggregate_metrics with planning_time_total/mean, impl_time_total/mean, planning_timeouts
Phase timing breakdown in _print_summary output
New python/agentize/eval/README.md folder description

Test plan

68/68 Python tests pass
New tests: timing field defaults, planning_timeout status, phase timing aggregation
Updated existing nlcmd timeout test for new planning_timeout status
Backward compatible: raw/impl modes have planning_time=0.0

Closes #985

🤖 Generated with Claude Code

…EADME - python/agentize/eval/eval_harness.md: Document --planning-timeout flag and planning_timeout status for full/nlcmd modes - python/agentize/eval/README.md: New folder README describing eval contents - python/tests/test_eval_harness.md: Expand scope to include planning-timeout and timing fields Related: #985

- TestMakeResult: Assert planning_time/impl_time defaults - TestAggregateMetricsPhaseTimming: planning/impl time aggregation, planning_timeouts counter, zero timing when fields absent - TestNlcmdImpl: planning_timeout status, planning_time recording Tests expected to fail until implementation is complete. Related: #985

- eval_harness.py: Add planning_time/impl_time to _make_result, record phase boundaries in run_nlcmd_impl (planning_start/impl_start) and run_full_impl (timing_bucket). Add --planning-timeout CLI flag (default 600s) replacing hardcoded timeout//2. New planning_timeout status when planning exceeds limit. Extend aggregate_metrics with planning_time_*, impl_time_*, and planning_timeouts. Update _print_summary to show phase timing breakdown. - test_eval_harness.py: Add TestMakeResult timing fields, planning_timeout status test, planning_time recording test, aggregate metrics phase timing tests. Update existing timeout test for new planning_timeout status. All 68 tests pass. Related: #985

- Pass timing_bucket list into _run_full_impl_body to record planning_time at the split point (after run_planning_phase, before FSM) - Extract planning_time from bucket after thread completes, derive impl_time - Show per-task phase breakdown in output: "Time: 400s (plan: 300s, impl: 100s)" - Planning/impl timing now visible in both per-task output and summary Related: #985

- Wrap run_planning_phase() in a daemon sub-thread inside _run_full_impl_body with planning_timeout join deadline - If planning exceeds the limit, fall back to raw problem statement and continue to impl phase (still records planning_time) - Pass planning_timeout through run_full_impl → _run_full_impl_body - Previously --planning-timeout only worked for nlcmd mode Related: #985

- Write full result dicts (including planning_time, impl_time, cost_usd) to results.jsonl alongside predictions.jsonl and metrics.json - Incremental save after each task (survives crashes) - Enables post-run analysis of per-task phase timing without re-running Related: #985

- Per-task planning vs impl time table (5 tasks, --planning-timeout 600) - Planning accounts for 51-93% of total time (mean 73%) - 2/5 tasks hit 600s planning timeout - Per-agent timing ranges from logs Related: #985

… nlcmd and codex - Add raw opus timing breakdown (0% planning, all impl) - Restructure as 4 sections: raw opus, full, nlcmd (TBD), codex impl (TBD) - Key observations section comparing across modes Related: #985

- nlcmd: 2/5 tasks, all hit 600s planning timeout (partial plans found) - codex impl: 2/5 tasks, 1-2 iterations (completion marker ~70% reliable) - Note probabilistic Codex completion behavior Related: #985

- Raw opus: $16.77, 23 min, 10/20 resolved (same rate as sonnet, pending scoring) Related: #985

…ch scoring Related: #985

…dex API crashes Related: #985

Ayazhan Kadessova added 3 commits April 6, 2026 20:14

ayazhankadessova added the agentize:pr PR created by agentize label Apr 6, 2026

Ayazhan Kadessova added 9 commits April 7, 2026 03:25

[docs][eval] Add phase timing breakdown from 5-task sample run

44c2f08

- Per-task planning vs impl time table (5 tasks, --planning-timeout 600) - Planning accounts for 51-93% of total time (mean 73%) - 2/5 tasks hit 600s planning timeout - Per-agent timing ranges from logs Related: #985

[docs][eval] Fill nlcmd and codex impl phase timing from partial runs

cec0399

- nlcmd: 2/5 tasks, all hit 600s planning timeout (partial plans found) - codex impl: 2/5 tasks, 1-2 iterations (completion marker ~70% reliable) - Note probabilistic Codex completion behavior Related: #985

[docs][eval] Add raw opus row to resolve rates table

16fbb5d

- Raw opus: $16.77, 23 min, 10/20 resolved (same rate as sonnet, pending scoring) Related: #985

[docs][eval] Update raw opus resolve rate to 11/20 (55%) from SWE-ben…

cd4511d

…ch scoring Related: #985

[docs][eval] Update codex impl 20-task results: 8/20 completed, 12 Co…

40c0448

…dex API crashes Related: #985

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#985][feat][eval] Add per-phase time tracking and planning timeout#987

[#985][feat][eval] Add per-phase time tracking and planning timeout#987
ayazhankadessova wants to merge 12 commits intomainfrom
issue-985

ayazhankadessova commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ayazhankadessova commented Apr 6, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants