Conversation
🦋 Changeset detectedLatest commit: 4b7dae4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
🐳 Docker Images Published
Usage: FROM cloudflare/sandbox:0.0.0-pr-487-4b7dae4Version: 📦 Standalone BinaryFor arbitrary Dockerfiles: COPY --from=cloudflare/sandbox:0.0.0-pr-487-4b7dae4 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]Download via GitHub CLI: gh run download 23647879167 -n sandbox-binaryExtract from Docker: docker run --rm cloudflare/sandbox:0.0.0-pr-487-4b7dae4 cat /container-server/sandbox > sandbox && chmod +x sandbox |
|
@scuffi Bonk workflow failed. Check the logs for details. View workflow run · To retry, trigger Bonk again. |
|
/bonk re-review this PR |
There was a problem hiding this comment.
Review
Good improvement over the previous single-PID SIGTERM. Tree-killing and zombie detection are the right things to add here. The E2E tests are thorough and well-structured. There are a few issues to resolve before this is mergeable.
1. The exit-code-file mechanism is bypassed on kill — worth documenting
The earlier bonk review called out that waitForExitCode waits for the exit code file, not process exit. That's been addressed — killCommand now polls processExists(pid) instead, which is correct.
However, when SIGTERM kills the tree leaves-first, the command subshell is killed before it writes the exit code file. This means the execStream generator's exit-code-file polling loop will never resolve normally. This doesn't affect killCommand's correctness directly since ProcessService.killProcess() updates the store status to killed regardless. But it's worth a comment in the code noting that the exit-code-file mechanism is bypassed when a process is killed externally.
2. return false from post-SIGKILL verification is too aggressive
After SIGKILL + 5-second wait, getProcessTreePids is called. If it finds any PIDs, killCommand returns false. The processExists method correctly handles zombies now (checking /proc/{pid}/status for Z state), so that part of the bonk review has been addressed.
But returning false causes SessionManager.killCommand to return { success: false }, which causes ProcessService.killProcess to skip updating the store (lines 365-370 of process-service.ts only update on success). The process stays as running in the store even though SIGKILL was sent and it's effectively dead. After SIGKILL, the process is guaranteed to die — return true with a warning log instead.
3. getProcessTreePids double-checks liveness
Descendants are already filtered through processExists at the recursive entry (line 837). Filtering again at line 849 adds redundant /proc reads. A process that was alive at line 837 could die by line 849, but that's benign. Remove the redundant filter or add a comment explaining the intent.
4. SessionManager.killCommand doesn't pass waitForExit
SessionManager.killCommand at line 847 calls session.killCommand(commandId) without forwarding waitForExit. This works because the default is true, but SessionManager is the public API for process killing — any future caller needing waitForExit: false would need plumbing through SessionManager too. Fine for now, just noting it.
5. Missing changeset
This PR changes user-visible behavior (process kill now terminates the full tree instead of only the root PID). This warrants a changeset referencing @cloudflare/sandbox per repo conventions.
Tests
The E2E tests are solid — good use of unique tokens, proper finally blocks, testing multiple tree depths and kill-all scenarios.
-
The "session destroy kills processes" test calls
waitForChildPid(pidFile)usingreadExecStdoutwith default sessionheaders, while the process was started insessionHeaders. This works because the pid file is on the shared filesystem, but it's subtle — a brief comment would help. -
The deep tree test's inner script quoting (
'"'"') is correct but hard to read. Consider using a heredoc or double-quoting approach for clarity.
| ); | ||
| this.runningCommands.delete(commandId); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
After SIGKILL, the process is guaranteed by the kernel to die — it's just a matter of scheduling. Returning false here causes ProcessService.killProcess() to skip the store update (it only updates on success: true at process-service.ts:365-370), leaving the process as running in the store even though SIGKILL was sent.
Consider returning true with the warning log:
const pidsAlive = this.getProcessTreePids(pid);
if (pidsAlive.length > 0) {
this.logger.warn(
'killCommand: some PIDs still visible after SIGKILL (may be in-flight reap)',
{ commandId, pid, remainingPids: pidsAlive }
);
}
this.runningCommands.delete(commandId);
return true;|
|
||
| return [ | ||
| pid, | ||
| ...descendants.filter((childPid) => this.processExists(childPid)) |
There was a problem hiding this comment.
Descendants returned from recursive getProcessTreePids calls have already passed the processExists check at line 837. This filter adds redundant /proc reads for each descendant. Either remove it or add a comment explaining why the double-check is intentional (e.g., "recheck because process may have exited between recursive call and return").
| * @param targetPid - Root process ID | ||
| * @param signal - Signal to send | ||
| */ | ||
| private terminateTree(targetPid: number, signal: NodeJS.Signals): void { |
There was a problem hiding this comment.
Minor: the JSDoc mentions the TOCTOU limitation clearly — good. One addition worth making: note that readFileSync is intentional here to avoid yielding the event loop mid-tree-walk, which could allow the tree to mutate between reading children of sibling nodes. The synchronous read keeps the snapshot as consistent as possible.
| return false; | ||
| } | ||
| } else { | ||
| this.terminateTree(pid, 'SIGKILL'); |
There was a problem hiding this comment.
When waitForExit is false, only SIGKILL is sent with no verification. This is correct for destroy() (the session is being torn down anyway), but it means runningCommands.delete() and return true happen before the process is confirmed dead. Fine for the destroy path, but worth a comment noting that the fire-and-forget semantics are intentional here.
|
/bonk re-review this PR |
This comment was marked as outdated.
This comment was marked as outdated.
3f08a21 to
f162392
Compare
|
/bonk review this PR |
This comment was marked as outdated.
This comment was marked as outdated.
|
/bonk re-review this PR |
There was a problem hiding this comment.
Review
The tree-killing approach is correct and well-implemented. Leaf-first signal delivery, zombie detection via /proc/{pid}/status, and writeExitCodeIfMissing to bridge the kill path with the execStream polling loop are all sound choices. The E2E test coverage is thorough — single tree, deep tree, kill-all, and session destroy scenarios all exercise the important paths.
I agree with the outstanding items from the latest bonk review. Here's my assessment of what still needs attention, ordered by priority:
1. Stale comment at session.ts:491-495 — needs update
The comment says the exit-code-file loop "will never resolve normally" when a process is killed, but writeExitCodeIfMissing() now synthesizes the file so it resolves via the normal file-check path. The bonk suggestion is accurate — update the comment to describe the current behavior.
2. readFileSync in getProcessChildren deserves a comment (session.ts:821)
The synchronous /proc read is the right call — async reads would yield the event loop mid-tree-walk, allowing the tree to mutate between siblings. This is a deliberate design decision that looks like a mistake to anyone unfamiliar with the rationale. A one-line comment would prevent future "helpful" refactors to async.
3. waitForExit=false semantics at session.ts:752-755
The fire-and-forget path sends SIGKILL and returns true immediately. This is only used from destroy() where the shell itself is killed next, so it's correct. But the semantics (process death not confirmed on return) should be documented for future callers.
4. Missing changeset
Tree-killing changes user-visible behavior — a kill now terminates child processes that previously survived. This warrants a changeset referencing @cloudflare/sandbox.
Items from bonk review that are already resolved
-
interpretExitCode>= 128 concern: The bonk review body says this "treats all codes >= 128 as killed" but the actual code already uses the narrow check (=== 137 || === 143). The code is correct; the review comment was inaccurate on this point. The unit tests atprocess-manager.test.ts:121-124confirmexit 130andexit 255correctly returnfailed. -
Dual cleanup of
runningCommands: The existing comment atsession.ts:758-759already explains this. The bonk review suggested a slightly different wording, but the current comment ("execStream() also calls untrackCommand()...") conveys the same information. No change needed unless the author prefers the alternative phrasing.
One additional observation
writeExitCodeIfMissing uses open(path, 'wx') for exclusive create, which is correct. There's a subtle interaction with the bash wrapper's .tmp → mv (atomic rename) pattern: mv will overwrite the synthetic file if the wrapper races to completion. In practice this is benign — if waitForProcessExit confirmed the root PID is dead, the wrapper subshell (which runs in the same process tree) is also dead and won't execute the mv. For the SIGTERM path where the wrapper may still run, the real exit code overwriting the synthetic one is actually preferable. No action needed, just noting for completeness.
Overall this is solid work. The four items above (stale comment, readFileSync rationale, waitForExit=false docs, changeset) are the remaining actionable items.
| // External kills via killCommand() can terminate the bash wrapper before it writes | ||
| // the exit code file, so killCommand() synthesizes one to unblock this loop. | ||
| // If neither the wrapper nor killCommand() writes the file and the shell dies, | ||
| // SessionDestroyedError still surfaces through the readiness check below. |
There was a problem hiding this comment.
The comment here is now stale. writeExitCodeIfMissing() in killCommand writes a synthetic exit code file, so this loop resolves normally by finding it. Suggested update:
// Wait until exit code file exists, checking for shell death on each iteration.
// If the process is killed externally via killCommand(), the bash wrapper is
// terminated before it writes the exit code file. killCommand() writes a
// synthetic exit code (137 for SIGKILL, 143 for SIGTERM) via writeExitCodeIfMissing(),
// so this loop resolves normally by finding that file.
| private getProcessChildren(pid: number): number[] { | ||
| try { | ||
| const childrenFile = `/proc/${pid}/task/${pid}/children`; | ||
| const children = readFileSync(childrenFile, 'utf8') |
There was a problem hiding this comment.
readFileSync is intentional here — using async reads would yield the event loop between sibling reads, allowing the process tree to mutate mid-walk. Worth a brief comment so this doesn't get "fixed" to async later:
// Uses readFileSync intentionally: yielding the event loop mid-tree-walk
// would allow the process tree to mutate between sibling reads.| ); | ||
| } | ||
| } else { | ||
| this.terminateTree(pid, 'SIGKILL'); |
There was a problem hiding this comment.
When waitForExit=false, SIGKILL is sent but process death is not confirmed before returning true. This is correct for destroy() (the shell itself is killed next), but the fire-and-forget semantics should be documented for future callers:
} else {
// Fire-and-forget: SIGKILL is sent but process death is not awaited.
// Only used from destroy() where the session is being torn down.
this.terminateTree(pid, 'SIGKILL');
await this.writeExitCodeIfMissing(handle.exitCodeFile, 137);
}
This comment was marked as outdated.
This comment was marked as outdated.
|
/bonk re-review this PR |
|
/bonk review this PR |
This comment was marked as outdated.
This comment was marked as outdated.
|
/bonk re-review this PR |
|
/bonk re-review this PR |
There was a problem hiding this comment.
Review
The overall design is sound: leaf-first tree killing, zombie detection via /proc/{pid}/status, writeExitCodeIfMissing to bridge the kill path with the execStream polling loop, and the narrow interpretExitCode check (=== 137 || === 143). The E2E tests are thorough — good coverage of single tree, deep tree, kill-all, and session destroy scenarios. Prior review feedback has been addressed well across the revision history.
Two items remain — one is a real bug in the kill-to-store-update flow, the other is a minor type cleanup.
1. killCommand returning false when PID file is gone prevents store update (bug, non-blocking)
At session.ts:766-767, when the PID file is already gone (command completed naturally between detection and kill), killCommand returns false. This flows through SessionManager.killCommand (session-manager.ts:859) which wraps false as { success: false }, then to ProcessService.killProcess (process-service.ts:365) which only updates the store to 'killed' on result.success === true. The process remains as running in the store even though it's dead.
This is a pre-existing issue (the old single-PID code had the same return false), so it's not a regression introduced by this PR. Worth tracking as a follow-up — the fix would be to return true when the PID file is gone (the process already exited, so the kill "succeeded" in the sense that nothing is running).
2. syntheticExitCode type is wider than necessary (session.ts:709)
syntheticExitCode is declared as let syntheticExitCode: number (good, this was already narrowed from number | null in an earlier revision — disregard if this was the latest bonk suggestion and already applied). The current code is correct.
The implementation is solid. No blocking issues from my review. The changeset correctly references @cloudflare/sandbox, the test-results/ directory is properly gitignored, and CI is passing. Ship it.
|
Review posted on PR #487. Here's a summary: No blocking issues. The implementation is solid:
I flagged one non-blocking bug (pre-existing, not a regression): when the PID file is already gone at kill time, |
|
/bonk review this PR |
|
Review posted successfully on PR #487. Here's a summary of what I found: Main review covers 3 items:
1 inline comment on The implementation is solid overall — the tree-killing approach, zombie detection, atomic |
killCommand() waited only for the root PID after SIGTERM, so children were immediately SIGKILL'd when the root bash wrapper exited quickly. Now waits for the full tree, does a fresh /proc walk after SIGKILL escalation to catch late-spawned descendants, and handles the empty-tree race by writing a synthetic exit code before returning. E2E tests now capture leaf sleep PIDs (not just bash wrappers), assert status=killed unconditionally, and include a SIGTERM-resistant fixture that verifies the 5-second grace period and SIGKILL escalation.
Summary
When a command spawns child processes (e.g. a shell script that forks subshells or long-running background jobs), killing that command previously only signalled the root PID. Child processes were left alive as orphans. This PR fixes that by walking the process tree bottom-up before delivering signals, children are killed before their parents, preventing zombie accumulation and ensuring the full tree is torn down.
Changes
session.ts—killCommand()now enumerates the full process tree via/proc, signals leaf processes first (SIGTERM → SIGKILL escalation after 5s), and writes an atomic exit code file soexecStream()doesn't race against the kill path. Sessiondestroy()uses a fast-path that fires SIGKILL immediately without waiting.process-manager.ts— Exit codes 137 and 143 (signal-style) are now interpreted as'killed'rather than'failed'.kill-all, and process cleanup on session destroy.