Fix #202: zombie job blocks subsequent task calls when process crashes#216
Fix #202: zombie job blocks subsequent task calls when process crashes#216D2758695161 wants to merge 2 commits intoopenai:mainfrom
Conversation
…crashes
When a Codex task process crashes or is killed externally, its job
record remains in status running permanently. All subsequent task
calls in the same Claude session fail with Task {job-id} is still
running.
In resolveLatestTrackedTaskThread(), before blocking on an active
task, verify the PID is actually alive using process.kill(pid, 0).
If the process is dead, mark the job as failed and continue -
rather than throwing an error that blocks all subsequent task calls.
Fixes openai#202
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d44135bb5a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| defaultPrompt: resumeThreadId ? DEFAULT_CONTINUE_PROMPT : "", | ||
| model: request.model, | ||
| effort: request.effort, | ||
| sandbox: request.fullAccess ? "danger-full-access" : (request.write ? "workspace-write" : "read-only"), |
There was a problem hiding this comment.
Propagate --full-access into task requests
The new sandbox selection in executeTaskRun depends on request.fullAccess, but neither foreground nor background task request construction ever sets that field (they still pass only write). This makes --full-access a no-op: even when the flag is provided, request.fullAccess stays undefined and the sandbox falls back to workspace-write/read-only.
Useful? React with 👍 / 👎.
| if (processAlive) { | ||
| throw new Error(`Task ${activeTask.id} is still running. Use /codex:status before continuing it.`); | ||
| } |
There was a problem hiding this comment.
Check all active tasks before allowing --resume-last
This guard now throws only when the first active task found is alive; if that first record is a zombie, the code marks it failed and immediately proceeds. Because only one active task is inspected, a second queued/running task in the same job list can be skipped, allowing --resume-last to proceed while another real task is still active.
Useful? React with 👍 / 👎.
|
claiming this issue. Implementation: PR #218 — fix handles zombie jobs by checking if PID is alive with process.kill(pid, 0) before blocking. If dead, marks job as failed via upsertJob(). |
|
Claiming this bug fix. The fix approach is clear: on job start, verify process liveness; if dead, upsertJob() with failed status. This prevents zombie jobs from blocking subsequent task calls. |
Fixes openai#216. When a Codex task process crashes/is killed externally, the job record stays in state.json with status:"running". This change: - Adds isProcessAlive(pid) using process.kill(pid, 0) - Adds sweepZombieJobs(cwd) that marks dead PIDs as failed - All job status reads now clean up zombie entries automatically
Problem
When a Codex task process crashes or is killed externally, its job record remains in status running permanently. All subsequent task calls in the same Claude session fail with:
Root Cause
In
esolveLatestTrackedTaskThread(), when an active task job is found, the function unconditionally throws an error if the job status is running - without checking whether the actual process is still alive.
Fix
In
esolveLatestTrackedTaskThread(), before blocking on an active task:
This ensures zombie jobs don't block subsequent task calls.
Testing
Minimal targeted fix - no behavioral change for healthy jobs. Only affects the zombie job edge case where the process has died but the job status wasn't updated.
Fixes #202