Improve process tree kill logic by scuffi · Pull Request #487 · cloudflare/sandbox-sdk

scuffi · 2026-03-11T16:32:39Z

Summary

When a command spawns child processes (e.g. a shell script that forks subshells or long-running background jobs), killing that command previously only signalled the root PID. Child processes were left alive as orphans. This PR fixes that by walking the process tree bottom-up before delivering signals, children are killed before their parents, preventing zombie accumulation and ensuring the full tree is torn down.

Changes

session.ts — killCommand() now enumerates the full process tree via /proc, signals leaf processes first (SIGTERM → SIGKILL escalation after 5s), and writes an atomic exit code file so execStream() doesn't race against the kill path. Session destroy() uses a fast-path that fires SIGKILL immediately without waiting.
process-manager.ts — Exit codes 137 and 143 (signal-style) are now interpreted as 'killed' rather than 'failed'.
E2E tests — New tests covering: 3-level deep process tree kill, concurrent bulk kill via kill-all, and process cleanup on session destroy.
Unit tests — Coverage for the updated signal exit code interpretation.

changeset-bot · 2026-03-11T16:39:20Z

🦋 Changeset detected

Latest commit: 4b7dae4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/sandbox	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-03-11T16:43:21Z

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@487

commit: 4b7dae4

github-actions · 2026-03-11T16:43:23Z

🐳 Docker Images Published

Variant	Image
Default	`cloudflare/sandbox:0.0.0-pr-487-4b7dae4`
Python	`cloudflare/sandbox:0.0.0-pr-487-4b7dae4-python`
OpenCode	`cloudflare/sandbox:0.0.0-pr-487-4b7dae4-opencode`
Musl	`cloudflare/sandbox:0.0.0-pr-487-4b7dae4-musl`
Desktop	`cloudflare/sandbox:0.0.0-pr-487-4b7dae4-desktop`

Usage:

FROM cloudflare/sandbox:0.0.0-pr-487-4b7dae4

Version: 0.0.0-pr-487-4b7dae4

📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-487-4b7dae4 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23647879167 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-487-4b7dae4 cat /container-server/sandbox > sandbox && chmod +x sandbox

ask-bonk · 2026-03-11T17:04:22Z

@scuffi Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

ask-bonk · 2026-03-11T17:04:41Z

Not Found - https://docs.github.com/rest/reactions/reactions#create-reaction-for-an-issue-comment

github run

scuffi · 2026-03-11T17:05:32Z

/bonk re-review this PR

ask-bonk

Review

Good improvement over the previous single-PID SIGTERM. Tree-killing and zombie detection are the right things to add here. The E2E tests are thorough and well-structured. There are a few issues to resolve before this is mergeable.

1. The exit-code-file mechanism is bypassed on kill — worth documenting

The earlier bonk review called out that waitForExitCode waits for the exit code file, not process exit. That's been addressed — killCommand now polls processExists(pid) instead, which is correct.

However, when SIGTERM kills the tree leaves-first, the command subshell is killed before it writes the exit code file. This means the execStream generator's exit-code-file polling loop will never resolve normally. This doesn't affect killCommand's correctness directly since ProcessService.killProcess() updates the store status to killed regardless. But it's worth a comment in the code noting that the exit-code-file mechanism is bypassed when a process is killed externally.

2. `return false` from post-SIGKILL verification is too aggressive

After SIGKILL + 5-second wait, getProcessTreePids is called. If it finds any PIDs, killCommand returns false. The processExists method correctly handles zombies now (checking /proc/{pid}/status for Z state), so that part of the bonk review has been addressed.

But returning false causes SessionManager.killCommand to return { success: false }, which causes ProcessService.killProcess to skip updating the store (lines 365-370 of process-service.ts only update on success). The process stays as running in the store even though SIGKILL was sent and it's effectively dead. After SIGKILL, the process is guaranteed to die — return true with a warning log instead.

3. `getProcessTreePids` double-checks liveness

Descendants are already filtered through processExists at the recursive entry (line 837). Filtering again at line 849 adds redundant /proc reads. A process that was alive at line 837 could die by line 849, but that's benign. Remove the redundant filter or add a comment explaining the intent.

4. `SessionManager.killCommand` doesn't pass `waitForExit`

SessionManager.killCommand at line 847 calls session.killCommand(commandId) without forwarding waitForExit. This works because the default is true, but SessionManager is the public API for process killing — any future caller needing waitForExit: false would need plumbing through SessionManager too. Fine for now, just noting it.

5. Missing changeset

This PR changes user-visible behavior (process kill now terminates the full tree instead of only the root PID). This warrants a changeset referencing @cloudflare/sandbox per repo conventions.

Tests

The E2E tests are solid — good use of unique tokens, proper finally blocks, testing multiple tree depths and kill-all scenarios.

The "session destroy kills processes" test calls waitForChildPid(pidFile) using readExecStdout with default session headers, while the process was started in sessionHeaders. This works because the pid file is on the shared filesystem, but it's subtle — a brief comment would help.
The deep tree test's inner script quoting ('"'"') is correct but hard to read. Consider using a heredoc or double-quoting approach for clarity.

ask-bonk · 2026-03-11T17:10:06Z

packages/sandbox-container/src/session.ts

+              );
+              this.runningCommands.delete(commandId);
+              return false;
+            }


After SIGKILL, the process is guaranteed by the kernel to die — it's just a matter of scheduling. Returning false here causes ProcessService.killProcess() to skip the store update (it only updates on success: true at process-service.ts:365-370), leaving the process as running in the store even though SIGKILL was sent.

Consider returning true with the warning log:

const pidsAlive = this.getProcessTreePids(pid); if (pidsAlive.length > 0) { this.logger.warn( 'killCommand: some PIDs still visible after SIGKILL (may be in-flight reap)', { commandId, pid, remainingPids: pidsAlive } ); } this.runningCommands.delete(commandId); return true;

ask-bonk · 2026-03-11T17:10:06Z

packages/sandbox-container/src/session.ts

+
+    return [
+      pid,
+      ...descendants.filter((childPid) => this.processExists(childPid))


Descendants returned from recursive getProcessTreePids calls have already passed the processExists check at line 837. This filter adds redundant /proc reads for each descendant. Either remove it or add a comment explaining why the double-check is intentional (e.g., "recheck because process may have exited between recursive call and return").

ask-bonk · 2026-03-11T17:10:06Z

packages/sandbox-container/src/session.ts

+   * @param targetPid - Root process ID
+   * @param signal - Signal to send
+   */
+  private terminateTree(targetPid: number, signal: NodeJS.Signals): void {


Minor: the JSDoc mentions the TOCTOU limitation clearly — good. One addition worth making: note that readFileSync is intentional here to avoid yielding the event loop mid-tree-walk, which could allow the tree to mutate between reading children of sibling nodes. The synchronous read keeps the snapshot as consistent as possible.

ask-bonk · 2026-03-11T17:10:06Z

packages/sandbox-container/src/session.ts

+              return false;
+            }
+          } else {
+            this.terminateTree(pid, 'SIGKILL');


When waitForExit is false, only SIGKILL is sent with no verification. This is correct for destroy() (the session is being torn down anyway), but it means runningCommands.delete() and return true happen before the process is confirmed dead. Fine for the destroy path, but worth a comment noting that the fire-and-forget semantics are intentional here.

scuffi · 2026-03-11T17:14:52Z

/bonk re-review this PR

scuffi · 2026-03-18T17:36:23Z

/bonk review this PR

scuffi · 2026-03-18T17:53:11Z

/bonk re-review this PR

ask-bonk

Review

The tree-killing approach is correct and well-implemented. Leaf-first signal delivery, zombie detection via /proc/{pid}/status, and writeExitCodeIfMissing to bridge the kill path with the execStream polling loop are all sound choices. The E2E test coverage is thorough — single tree, deep tree, kill-all, and session destroy scenarios all exercise the important paths.

I agree with the outstanding items from the latest bonk review. Here's my assessment of what still needs attention, ordered by priority:

1. Stale comment at `session.ts:491-495` — needs update

The comment says the exit-code-file loop "will never resolve normally" when a process is killed, but writeExitCodeIfMissing() now synthesizes the file so it resolves via the normal file-check path. The bonk suggestion is accurate — update the comment to describe the current behavior.

2. `readFileSync` in `getProcessChildren` deserves a comment (`session.ts:821`)

The synchronous /proc read is the right call — async reads would yield the event loop mid-tree-walk, allowing the tree to mutate between siblings. This is a deliberate design decision that looks like a mistake to anyone unfamiliar with the rationale. A one-line comment would prevent future "helpful" refactors to async.

3. `waitForExit=false` semantics at `session.ts:752-755`

The fire-and-forget path sends SIGKILL and returns true immediately. This is only used from destroy() where the shell itself is killed next, so it's correct. But the semantics (process death not confirmed on return) should be documented for future callers.

4. Missing changeset

Tree-killing changes user-visible behavior — a kill now terminates child processes that previously survived. This warrants a changeset referencing @cloudflare/sandbox.

Items from bonk review that are already resolved

interpretExitCode >= 128 concern: The bonk review body says this "treats all codes >= 128 as killed" but the actual code already uses the narrow check (=== 137 || === 143). The code is correct; the review comment was inaccurate on this point. The unit tests at process-manager.test.ts:121-124 confirm exit 130 and exit 255 correctly return failed.
Dual cleanup of runningCommands: The existing comment at session.ts:758-759 already explains this. The bonk review suggested a slightly different wording, but the current comment ("execStream() also calls untrackCommand()...") conveys the same information. No change needed unless the author prefers the alternative phrasing.

One additional observation

writeExitCodeIfMissing uses open(path, 'wx') for exclusive create, which is correct. There's a subtle interaction with the bash wrapper's .tmp → mv (atomic rename) pattern: mv will overwrite the synthetic file if the wrapper races to completion. In practice this is benign — if waitForProcessExit confirmed the root PID is dead, the wrapper subshell (which runs in the same process tree) is also dead and won't execute the mv. For the SIGTERM path where the wrapper may still run, the real exit code overwriting the synthetic one is actually preferable. No action needed, just noting for completeness.

Overall this is solid work. The four items above (stale comment, readFileSync rationale, waitForExit=false docs, changeset) are the remaining actionable items.

ask-bonk · 2026-03-18T17:56:19Z

packages/sandbox-container/src/session.ts

+      // External kills via killCommand() can terminate the bash wrapper before it writes
+      // the exit code file, so killCommand() synthesizes one to unblock this loop.
+      // If neither the wrapper nor killCommand() writes the file and the shell dies,
+      // SessionDestroyedError still surfaces through the readiness check below.


The comment here is now stale. writeExitCodeIfMissing() in killCommand writes a synthetic exit code file, so this loop resolves normally by finding it. Suggested update:

// Wait until exit code file exists, checking for shell death on each iteration. // If the process is killed externally via killCommand(), the bash wrapper is // terminated before it writes the exit code file. killCommand() writes a // synthetic exit code (137 for SIGKILL, 143 for SIGTERM) via writeExitCodeIfMissing(), // so this loop resolves normally by finding that file.

ask-bonk · 2026-03-18T17:56:19Z

packages/sandbox-container/src/session.ts

+  private getProcessChildren(pid: number): number[] {
+    try {
+      const childrenFile = `/proc/${pid}/task/${pid}/children`;
+      const children = readFileSync(childrenFile, 'utf8')


readFileSync is intentional here — using async reads would yield the event loop between sibling reads, allowing the process tree to mutate mid-walk. Worth a brief comment so this doesn't get "fixed" to async later:

// Uses readFileSync intentionally: yielding the event loop mid-tree-walk // would allow the process tree to mutate between sibling reads.

ask-bonk · 2026-03-18T17:56:19Z

packages/sandbox-container/src/session.ts

+              );
+            }
+          } else {
+            this.terminateTree(pid, 'SIGKILL');


When waitForExit=false, SIGKILL is sent but process death is not confirmed before returning true. This is correct for destroy() (the shell itself is killed next), but the fire-and-forget semantics should be documented for future callers:

} else { // Fire-and-forget: SIGKILL is sent but process death is not awaited. // Only used from destroy() where the session is being torn down. this.terminateTree(pid, 'SIGKILL'); await this.writeExitCodeIfMissing(handle.exitCodeFile, 137); }

scuffi · 2026-03-19T14:05:04Z

/bonk re-review this PR

scuffi · 2026-03-19T14:26:05Z

/bonk review this PR

scuffi · 2026-03-19T14:45:24Z

/bonk re-review this PR

scuffi · 2026-03-19T14:51:20Z

/bonk re-review this PR

ask-bonk

Review

The overall design is sound: leaf-first tree killing, zombie detection via /proc/{pid}/status, writeExitCodeIfMissing to bridge the kill path with the execStream polling loop, and the narrow interpretExitCode check (=== 137 || === 143). The E2E tests are thorough — good coverage of single tree, deep tree, kill-all, and session destroy scenarios. Prior review feedback has been addressed well across the revision history.

Two items remain — one is a real bug in the kill-to-store-update flow, the other is a minor type cleanup.

1. `killCommand` returning `false` when PID file is gone prevents store update (bug, non-blocking)

At session.ts:766-767, when the PID file is already gone (command completed naturally between detection and kill), killCommand returns false. This flows through SessionManager.killCommand (session-manager.ts:859) which wraps false as { success: false }, then to ProcessService.killProcess (process-service.ts:365) which only updates the store to 'killed' on result.success === true. The process remains as running in the store even though it's dead.

This is a pre-existing issue (the old single-PID code had the same return false), so it's not a regression introduced by this PR. Worth tracking as a follow-up — the fix would be to return true when the PID file is gone (the process already exited, so the kill "succeeded" in the sense that nothing is running).

2. `syntheticExitCode` type is wider than necessary (`session.ts:709`)

syntheticExitCode is declared as let syntheticExitCode: number (good, this was already narrowed from number | null in an earlier revision — disregard if this was the latest bonk suggestion and already applied). The current code is correct.

The implementation is solid. No blocking issues from my review. The changeset correctly references @cloudflare/sandbox, the test-results/ directory is properly gitignored, and CI is passing. Ship it.

ask-bonk · 2026-03-19T14:53:21Z

Review posted on PR #487. Here's a summary:

No blocking issues. The implementation is solid:

Tree-killing approach is correct — leaf-first signal delivery, zombie detection via /proc/{pid}/status, and writeExitCodeIfMissing to bridge the kill/execStream paths.
interpretExitCode correctly uses the narrow === 137 || === 143 check.
Changeset is present and correctly references @cloudflare/sandbox.
E2E tests are thorough with good patterns (unique tokens, finally cleanup, polling for async death).
CI is passing.

I flagged one non-blocking bug (pre-existing, not a regression): when the PID file is already gone at kill time, killCommand returns false, which prevents the store from updating the process status. Worth a follow-up but not introduced by this PR.

github run

scuffi · 2026-03-23T18:21:41Z

/bonk review this PR

ask-bonk · 2026-03-23T18:23:52Z

Review posted successfully on PR #487. Here's a summary of what I found:

Main review covers 3 items:

Bug: .gitignore missing trailing newline — The file ends without a final \n after test-results/. This was inherited from the pre-existing file but should be fixed in this PR. An inline comment was left on the specific line.
Minor: waitForExit not plumbed through SessionManager — SessionManager.killCommand doesn't forward the waitForExit parameter. Works fine now (default true is correct for the ProcessService path), but noted for awareness.
Minor: pre-existing killCommand returning false when PID file is gone — When the command completes naturally between detection and kill, killCommand returns false, which causes ProcessService to skip the store update. Not a regression from this PR, but worth a follow-up.

1 inline comment on .gitignore:166 flagging the missing trailing newline.

The implementation is solid overall — the tree-killing approach, zombie detection, atomic writeExitCodeIfMissing via link(), and E2E test coverage are all well done. CI is passing.

github run

killCommand() waited only for the root PID after SIGTERM, so children were immediately SIGKILL'd when the root bash wrapper exited quickly. Now waits for the full tree, does a fresh /proc walk after SIGKILL escalation to catch late-spawned descendants, and handles the empty-tree race by writing a synthetic exit code before returning. E2E tests now capture leaf sleep PIDs (not just bash wrappers), assert status=killed unconditionally, and include a SIGTERM-resistant fixture that verifies the 5-second grace period and SIGKILL escalation.

Not relevant anymore

This comment was marked as outdated.

Sign in to view

ask-bonk bot previously requested changes Mar 11, 2026

View reviewed changes

cloudflare deleted a comment from ask-bonk bot Mar 11, 2026

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

kchro3 and others added 4 commits March 18, 2026 14:49

fix: terminate process trees on kill

fe03a79

fix killCommand bugs

0f69bb4

update zombie process issue

eb80dec

address bonk issues

f162392

scuffi force-pushed the fix/process-tree-kill-regression branch from 3f08a21 to f162392 Compare March 18, 2026 16:22

scuffi added 2 commits March 18, 2026 16:56

patch helper

2cd8bbf

fix execStream issue

c4ee087

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

bonk fixes

fa821a0

ask-bonk bot reviewed Mar 18, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Update some comments + changeset

3b4d764

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

Update type

6883805

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

Ignore test-results

28fe98b

cloudflare deleted a comment from ask-bonk bot Mar 19, 2026

ask-bonk bot reviewed Mar 19, 2026

View reviewed changes

scuffi marked this pull request as ready for review March 20, 2026 15:41

scuffi requested review from aron-cf, ghostwriternr and whoiskatrin as code owners March 20, 2026 15:41

This comment was marked as resolved.

Sign in to view

atomic exit code writing

dc53592

This comment was marked as resolved.

Sign in to view

sigterm patch

320f983

This comment was marked as resolved.

Sign in to view

ghostwriternr approved these changes Mar 27, 2026

View reviewed changes

ghostwriternr enabled auto-merge (squash) March 27, 2026 13:12

Merge branch 'main' into fix/process-tree-kill-regression

4b7dae4

ghostwriternr disabled auto-merge March 27, 2026 13:18

Conversation

scuffi commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

changeset-bot bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

pkg-pr-new bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐳 Docker Images Published

📦 Standalone Binary

Uh oh!

This comment was marked as outdated.

Uh oh!

ask-bonk bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk bot commented Mar 11, 2026

Uh oh!

scuffi commented Mar 11, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Review

1. The exit-code-file mechanism is bypassed on kill — worth documenting

2. return false from post-SIGKILL verification is too aggressive

3. getProcessTreePids double-checks liveness

4. SessionManager.killCommand doesn't pass waitForExit

5. Missing changeset

Tests

Uh oh!

ask-bonk bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

scuffi commented Mar 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

scuffi commented Mar 18, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

scuffi commented Mar 18, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Review

1. Stale comment at session.ts:491-495 — needs update

2. readFileSync in getProcessChildren deserves a comment (session.ts:821)

3. waitForExit=false semantics at session.ts:752-755

4. Missing changeset

Items from bonk review that are already resolved

One additional observation

Uh oh!

ask-bonk bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 18, 2026

Choose a reason for hiding this comment

scuffi commented Mar 11, 2026 •

edited

Loading

changeset-bot bot commented Mar 11, 2026 •

edited

Loading

pkg-pr-new bot commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

ask-bonk bot commented Mar 11, 2026 •

edited

Loading

2. `return false` from post-SIGKILL verification is too aggressive

3. `getProcessTreePids` double-checks liveness

4. `SessionManager.killCommand` doesn't pass `waitForExit`

1. Stale comment at `session.ts:491-495` — needs update

2. `readFileSync` in `getProcessChildren` deserves a comment (`session.ts:821`)

3. `waitForExit=false` semantics at `session.ts:752-755`

1. `killCommand` returning `false` when PID file is gone prevents store update (bug, non-blocking)

2. `syntheticExitCode` type is wider than necessary (`session.ts:709`)