Skip to content

Improve process tree kill logic#487

Open
scuffi wants to merge 15 commits intomainfrom
fix/process-tree-kill-regression
Open

Improve process tree kill logic#487
scuffi wants to merge 15 commits intomainfrom
fix/process-tree-kill-regression

Conversation

@scuffi
Copy link
Copy Markdown
Contributor

@scuffi scuffi commented Mar 11, 2026

Summary

When a command spawns child processes (e.g. a shell script that forks subshells or long-running background jobs), killing that command previously only signalled the root PID. Child processes were left alive as orphans. This PR fixes that by walking the process tree bottom-up before delivering signals, children are killed before their parents, preventing zombie accumulation and ensuring the full tree is torn down.

Changes

  • session.tskillCommand() now enumerates the full process tree via /proc, signals leaf processes first (SIGTERM → SIGKILL escalation after 5s), and writes an atomic exit code file so execStream() doesn't race against the kill path. Session destroy() uses a fast-path that fires SIGKILL immediately without waiting.
  • process-manager.ts — Exit codes 137 and 143 (signal-style) are now interpreted as 'killed' rather than 'failed'.
  • E2E tests — New tests covering: 3-level deep process tree kill, concurrent bulk kill via kill-all, and process cleanup on session destroy.
  • Unit tests — Coverage for the updated signal exit code interpretation.

Open with Devin

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 11, 2026

🦋 Changeset detected

Latest commit: 4b7dae4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 11, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@487

commit: 4b7dae4

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 11, 2026

🐳 Docker Images Published

Variant Image
Default cloudflare/sandbox:0.0.0-pr-487-4b7dae4
Python cloudflare/sandbox:0.0.0-pr-487-4b7dae4-python
OpenCode cloudflare/sandbox:0.0.0-pr-487-4b7dae4-opencode
Musl cloudflare/sandbox:0.0.0-pr-487-4b7dae4-musl
Desktop cloudflare/sandbox:0.0.0-pr-487-4b7dae4-desktop

Usage:

FROM cloudflare/sandbox:0.0.0-pr-487-4b7dae4

Version: 0.0.0-pr-487-4b7dae4


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-487-4b7dae4 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23647879167 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-487-4b7dae4 cat /container-server/sandbox > sandbox && chmod +x sandbox

ask-bonk[bot]

This comment was marked as outdated.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 11, 2026

@scuffi Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 11, 2026

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 11, 2026

/bonk re-review this PR

ask-bonk[bot]
ask-bonk bot previously requested changes Mar 11, 2026
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good improvement over the previous single-PID SIGTERM. Tree-killing and zombie detection are the right things to add here. The E2E tests are thorough and well-structured. There are a few issues to resolve before this is mergeable.

1. The exit-code-file mechanism is bypassed on kill — worth documenting

The earlier bonk review called out that waitForExitCode waits for the exit code file, not process exit. That's been addressed — killCommand now polls processExists(pid) instead, which is correct.

However, when SIGTERM kills the tree leaves-first, the command subshell is killed before it writes the exit code file. This means the execStream generator's exit-code-file polling loop will never resolve normally. This doesn't affect killCommand's correctness directly since ProcessService.killProcess() updates the store status to killed regardless. But it's worth a comment in the code noting that the exit-code-file mechanism is bypassed when a process is killed externally.

2. return false from post-SIGKILL verification is too aggressive

After SIGKILL + 5-second wait, getProcessTreePids is called. If it finds any PIDs, killCommand returns false. The processExists method correctly handles zombies now (checking /proc/{pid}/status for Z state), so that part of the bonk review has been addressed.

But returning false causes SessionManager.killCommand to return { success: false }, which causes ProcessService.killProcess to skip updating the store (lines 365-370 of process-service.ts only update on success). The process stays as running in the store even though SIGKILL was sent and it's effectively dead. After SIGKILL, the process is guaranteed to die — return true with a warning log instead.

3. getProcessTreePids double-checks liveness

Descendants are already filtered through processExists at the recursive entry (line 837). Filtering again at line 849 adds redundant /proc reads. A process that was alive at line 837 could die by line 849, but that's benign. Remove the redundant filter or add a comment explaining the intent.

4. SessionManager.killCommand doesn't pass waitForExit

SessionManager.killCommand at line 847 calls session.killCommand(commandId) without forwarding waitForExit. This works because the default is true, but SessionManager is the public API for process killing — any future caller needing waitForExit: false would need plumbing through SessionManager too. Fine for now, just noting it.

5. Missing changeset

This PR changes user-visible behavior (process kill now terminates the full tree instead of only the root PID). This warrants a changeset referencing @cloudflare/sandbox per repo conventions.

Tests

The E2E tests are solid — good use of unique tokens, proper finally blocks, testing multiple tree depths and kill-all scenarios.

  • The "session destroy kills processes" test calls waitForChildPid(pidFile) using readExecStdout with default session headers, while the process was started in sessionHeaders. This works because the pid file is on the shared filesystem, but it's subtle — a brief comment would help.

  • The deep tree test's inner script quoting ('"'"') is correct but hard to read. Consider using a heredoc or double-quoting approach for clarity.

);
this.runningCommands.delete(commandId);
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After SIGKILL, the process is guaranteed by the kernel to die — it's just a matter of scheduling. Returning false here causes ProcessService.killProcess() to skip the store update (it only updates on success: true at process-service.ts:365-370), leaving the process as running in the store even though SIGKILL was sent.

Consider returning true with the warning log:

const pidsAlive = this.getProcessTreePids(pid);
if (pidsAlive.length > 0) {
  this.logger.warn(
    'killCommand: some PIDs still visible after SIGKILL (may be in-flight reap)',
    { commandId, pid, remainingPids: pidsAlive }
  );
}
this.runningCommands.delete(commandId);
return true;


return [
pid,
...descendants.filter((childPid) => this.processExists(childPid))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descendants returned from recursive getProcessTreePids calls have already passed the processExists check at line 837. This filter adds redundant /proc reads for each descendant. Either remove it or add a comment explaining why the double-check is intentional (e.g., "recheck because process may have exited between recursive call and return").

* @param targetPid - Root process ID
* @param signal - Signal to send
*/
private terminateTree(targetPid: number, signal: NodeJS.Signals): void {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the JSDoc mentions the TOCTOU limitation clearly — good. One addition worth making: note that readFileSync is intentional here to avoid yielding the event loop mid-tree-walk, which could allow the tree to mutate between reading children of sibling nodes. The synchronous read keeps the snapshot as consistent as possible.

return false;
}
} else {
this.terminateTree(pid, 'SIGKILL');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When waitForExit is false, only SIGKILL is sent with no verification. This is correct for destroy() (the session is being torn down anyway), but it means runningCommands.delete() and return true happen before the process is confirmed dead. Fine for the destroy path, but worth a comment noting that the fire-and-forget semantics are intentional here.

@cloudflare cloudflare deleted a comment from ask-bonk bot Mar 11, 2026
@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 11, 2026

/bonk re-review this PR

ask-bonk[bot]

This comment was marked as outdated.

ask-bonk[bot]

This comment was marked as resolved.

@ask-bonk

This comment was marked as outdated.

@scuffi scuffi force-pushed the fix/process-tree-kill-regression branch from 3f08a21 to f162392 Compare March 18, 2026 16:22
@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 18, 2026

/bonk review this PR

ask-bonk[bot]

This comment was marked as outdated.

ask-bonk[bot]

This comment was marked as resolved.

@ask-bonk

This comment was marked as outdated.

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 18, 2026

/bonk re-review this PR

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The tree-killing approach is correct and well-implemented. Leaf-first signal delivery, zombie detection via /proc/{pid}/status, and writeExitCodeIfMissing to bridge the kill path with the execStream polling loop are all sound choices. The E2E test coverage is thorough — single tree, deep tree, kill-all, and session destroy scenarios all exercise the important paths.

I agree with the outstanding items from the latest bonk review. Here's my assessment of what still needs attention, ordered by priority:

1. Stale comment at session.ts:491-495 — needs update

The comment says the exit-code-file loop "will never resolve normally" when a process is killed, but writeExitCodeIfMissing() now synthesizes the file so it resolves via the normal file-check path. The bonk suggestion is accurate — update the comment to describe the current behavior.

2. readFileSync in getProcessChildren deserves a comment (session.ts:821)

The synchronous /proc read is the right call — async reads would yield the event loop mid-tree-walk, allowing the tree to mutate between siblings. This is a deliberate design decision that looks like a mistake to anyone unfamiliar with the rationale. A one-line comment would prevent future "helpful" refactors to async.

3. waitForExit=false semantics at session.ts:752-755

The fire-and-forget path sends SIGKILL and returns true immediately. This is only used from destroy() where the shell itself is killed next, so it's correct. But the semantics (process death not confirmed on return) should be documented for future callers.

4. Missing changeset

Tree-killing changes user-visible behavior — a kill now terminates child processes that previously survived. This warrants a changeset referencing @cloudflare/sandbox.

Items from bonk review that are already resolved

  • interpretExitCode >= 128 concern: The bonk review body says this "treats all codes >= 128 as killed" but the actual code already uses the narrow check (=== 137 || === 143). The code is correct; the review comment was inaccurate on this point. The unit tests at process-manager.test.ts:121-124 confirm exit 130 and exit 255 correctly return failed.

  • Dual cleanup of runningCommands: The existing comment at session.ts:758-759 already explains this. The bonk review suggested a slightly different wording, but the current comment ("execStream() also calls untrackCommand()...") conveys the same information. No change needed unless the author prefers the alternative phrasing.

One additional observation

writeExitCodeIfMissing uses open(path, 'wx') for exclusive create, which is correct. There's a subtle interaction with the bash wrapper's .tmpmv (atomic rename) pattern: mv will overwrite the synthetic file if the wrapper races to completion. In practice this is benign — if waitForProcessExit confirmed the root PID is dead, the wrapper subshell (which runs in the same process tree) is also dead and won't execute the mv. For the SIGTERM path where the wrapper may still run, the real exit code overwriting the synthetic one is actually preferable. No action needed, just noting for completeness.

Overall this is solid work. The four items above (stale comment, readFileSync rationale, waitForExit=false docs, changeset) are the remaining actionable items.

// External kills via killCommand() can terminate the bash wrapper before it writes
// the exit code file, so killCommand() synthesizes one to unblock this loop.
// If neither the wrapper nor killCommand() writes the file and the shell dies,
// SessionDestroyedError still surfaces through the readiness check below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is now stale. writeExitCodeIfMissing() in killCommand writes a synthetic exit code file, so this loop resolves normally by finding it. Suggested update:

// Wait until exit code file exists, checking for shell death on each iteration.
// If the process is killed externally via killCommand(), the bash wrapper is
// terminated before it writes the exit code file. killCommand() writes a
// synthetic exit code (137 for SIGKILL, 143 for SIGTERM) via writeExitCodeIfMissing(),
// so this loop resolves normally by finding that file.

private getProcessChildren(pid: number): number[] {
try {
const childrenFile = `/proc/${pid}/task/${pid}/children`;
const children = readFileSync(childrenFile, 'utf8')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readFileSync is intentional here — using async reads would yield the event loop between sibling reads, allowing the process tree to mutate mid-walk. Worth a brief comment so this doesn't get "fixed" to async later:

// Uses readFileSync intentionally: yielding the event loop mid-tree-walk
// would allow the process tree to mutate between sibling reads.

);
}
} else {
this.terminateTree(pid, 'SIGKILL');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When waitForExit=false, SIGKILL is sent but process death is not confirmed before returning true. This is correct for destroy() (the shell itself is killed next), but the fire-and-forget semantics should be documented for future callers:

} else {
  // Fire-and-forget: SIGKILL is sent but process death is not awaited.
  // Only used from destroy() where the session is being torn down.
  this.terminateTree(pid, 'SIGKILL');
  await this.writeExitCodeIfMissing(handle.exitCodeFile, 137);
}

@ask-bonk

This comment was marked as outdated.

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 19, 2026

/bonk re-review this PR

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 19, 2026

/bonk review this PR

ask-bonk[bot]

This comment was marked as outdated.

ask-bonk[bot]

This comment was marked as resolved.

@ask-bonk

This comment was marked as outdated.

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 19, 2026

/bonk re-review this PR

ask-bonk[bot]

This comment was marked as outdated.

ask-bonk[bot]

This comment was marked as resolved.

@cloudflare cloudflare deleted a comment from ask-bonk bot Mar 19, 2026
@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 19, 2026

/bonk re-review this PR

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The overall design is sound: leaf-first tree killing, zombie detection via /proc/{pid}/status, writeExitCodeIfMissing to bridge the kill path with the execStream polling loop, and the narrow interpretExitCode check (=== 137 || === 143). The E2E tests are thorough — good coverage of single tree, deep tree, kill-all, and session destroy scenarios. Prior review feedback has been addressed well across the revision history.

Two items remain — one is a real bug in the kill-to-store-update flow, the other is a minor type cleanup.

1. killCommand returning false when PID file is gone prevents store update (bug, non-blocking)

At session.ts:766-767, when the PID file is already gone (command completed naturally between detection and kill), killCommand returns false. This flows through SessionManager.killCommand (session-manager.ts:859) which wraps false as { success: false }, then to ProcessService.killProcess (process-service.ts:365) which only updates the store to 'killed' on result.success === true. The process remains as running in the store even though it's dead.

This is a pre-existing issue (the old single-PID code had the same return false), so it's not a regression introduced by this PR. Worth tracking as a follow-up — the fix would be to return true when the PID file is gone (the process already exited, so the kill "succeeded" in the sense that nothing is running).

2. syntheticExitCode type is wider than necessary (session.ts:709)

syntheticExitCode is declared as let syntheticExitCode: number (good, this was already narrowed from number | null in an earlier revision — disregard if this was the latest bonk suggestion and already applied). The current code is correct.


The implementation is solid. No blocking issues from my review. The changeset correctly references @cloudflare/sandbox, the test-results/ directory is properly gitignored, and CI is passing. Ship it.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 19, 2026

Review posted on PR #487. Here's a summary:

No blocking issues. The implementation is solid:

  • Tree-killing approach is correct — leaf-first signal delivery, zombie detection via /proc/{pid}/status, and writeExitCodeIfMissing to bridge the kill/execStream paths.
  • interpretExitCode correctly uses the narrow === 137 || === 143 check.
  • Changeset is present and correctly references @cloudflare/sandbox.
  • E2E tests are thorough with good patterns (unique tokens, finally cleanup, polling for async death).
  • CI is passing.

I flagged one non-blocking bug (pre-existing, not a regression): when the PID file is already gone at kill time, killCommand returns false, which prevents the store from updating the process status. Worth a follow-up but not introduced by this PR.

github run

@scuffi scuffi marked this pull request as ready for review March 20, 2026 15:41
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@scuffi
Copy link
Copy Markdown
Contributor Author

scuffi commented Mar 23, 2026

/bonk review this PR

ask-bonk[bot]

This comment was marked as resolved.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

Review posted successfully on PR #487. Here's a summary of what I found:

Main review covers 3 items:

  1. Bug: .gitignore missing trailing newline — The file ends without a final \n after test-results/. This was inherited from the pre-existing file but should be fixed in this PR. An inline comment was left on the specific line.

  2. Minor: waitForExit not plumbed through SessionManagerSessionManager.killCommand doesn't forward the waitForExit parameter. Works fine now (default true is correct for the ProcessService path), but noted for awareness.

  3. Minor: pre-existing killCommand returning false when PID file is gone — When the command completes naturally between detection and kill, killCommand returns false, which causes ProcessService to skip the store update. Not a regression from this PR, but worth a follow-up.

1 inline comment on .gitignore:166 flagging the missing trailing newline.

The implementation is solid overall — the tree-killing approach, zombie detection, atomic writeExitCodeIfMissing via link(), and E2E test coverage are all well done. CI is passing.

github run

killCommand() waited only for the root PID after SIGTERM, so children
were immediately SIGKILL'd when the root bash wrapper exited quickly.
Now waits for the full tree, does a fresh /proc walk after SIGKILL
escalation to catch late-spawned descendants, and handles the
empty-tree race by writing a synthetic exit code before returning.

E2E tests now capture leaf sleep PIDs (not just bash wrappers), assert
status=killed unconditionally, and include a SIGTERM-resistant fixture
that verifies the 5-second grace period and SIGKILL escalation.
@ghostwriternr ghostwriternr enabled auto-merge (squash) March 27, 2026 13:12
@ghostwriternr ghostwriternr disabled auto-merge March 27, 2026 13:18
@whoiskatrin whoiskatrin dismissed ask-bonk[bot]’s stale review March 27, 2026 13:41

Not relevant anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants