[#985][feat][eval] Add per-phase time tracking and planning timeout#987
Open
ayazhankadessova wants to merge 12 commits intomainfrom
Open
[#985][feat][eval] Add per-phase time tracking and planning timeout#987ayazhankadessova wants to merge 12 commits intomainfrom
ayazhankadessova wants to merge 12 commits intomainfrom
Conversation
added 3 commits
April 6, 2026 20:14
…EADME - python/agentize/eval/eval_harness.md: Document --planning-timeout flag and planning_timeout status for full/nlcmd modes - python/agentize/eval/README.md: New folder README describing eval contents - python/tests/test_eval_harness.md: Expand scope to include planning-timeout and timing fields Related: #985
- TestMakeResult: Assert planning_time/impl_time defaults - TestAggregateMetricsPhaseTimming: planning/impl time aggregation, planning_timeouts counter, zero timing when fields absent - TestNlcmdImpl: planning_timeout status, planning_time recording Tests expected to fail until implementation is complete. Related: #985
- eval_harness.py: Add planning_time/impl_time to _make_result, record phase boundaries in run_nlcmd_impl (planning_start/impl_start) and run_full_impl (timing_bucket). Add --planning-timeout CLI flag (default 600s) replacing hardcoded timeout//2. New planning_timeout status when planning exceeds limit. Extend aggregate_metrics with planning_time_*, impl_time_*, and planning_timeouts. Update _print_summary to show phase timing breakdown. - test_eval_harness.py: Add TestMakeResult timing fields, planning_timeout status test, planning_time recording test, aggregate metrics phase timing tests. Update existing timeout test for new planning_timeout status. All 68 tests pass. Related: #985
added 9 commits
April 7, 2026 03:25
- Pass timing_bucket list into _run_full_impl_body to record planning_time at the split point (after run_planning_phase, before FSM) - Extract planning_time from bucket after thread completes, derive impl_time - Show per-task phase breakdown in output: "Time: 400s (plan: 300s, impl: 100s)" - Planning/impl timing now visible in both per-task output and summary Related: #985
- Wrap run_planning_phase() in a daemon sub-thread inside _run_full_impl_body with planning_timeout join deadline - If planning exceeds the limit, fall back to raw problem statement and continue to impl phase (still records planning_time) - Pass planning_timeout through run_full_impl → _run_full_impl_body - Previously --planning-timeout only worked for nlcmd mode Related: #985
- Write full result dicts (including planning_time, impl_time, cost_usd) to results.jsonl alongside predictions.jsonl and metrics.json - Incremental save after each task (survives crashes) - Enables post-run analysis of per-task phase timing without re-running Related: #985
- Per-task planning vs impl time table (5 tasks, --planning-timeout 600) - Planning accounts for 51-93% of total time (mean 73%) - 2/5 tasks hit 600s planning timeout - Per-agent timing ranges from logs Related: #985
… nlcmd and codex - Add raw opus timing breakdown (0% planning, all impl) - Restructure as 4 sections: raw opus, full, nlcmd (TBD), codex impl (TBD) - Key observations section comparing across modes Related: #985
- nlcmd: 2/5 tasks, all hit 600s planning timeout (partial plans found) - codex impl: 2/5 tasks, 1-2 iterations (completion marker ~70% reliable) - Note probabilistic Codex completion behavior Related: #985
- Raw opus: $16.77, 23 min, 10/20 resolved (same rate as sonnet, pending scoring) Related: #985
…ch scoring Related: #985
…dex API crashes Related: #985
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
planning_timeandimpl_timeper-task fields to eval harness results--planning-timeoutCLI flag (default 600s) replacing hardcodedtimeout // 2planning_timeoutstatus when planning exceeds the configured limitaggregate_metricswithplanning_time_total/mean,impl_time_total/mean,planning_timeouts_print_summaryoutputpython/agentize/eval/README.mdfolder descriptionTest plan
Closes #985
🤖 Generated with Claude Code