Skip to content

fix: repair TCP port-remapping slow path#3

Merged
congwang-mk merged 6 commits intomultikernel:mainfrom
ghazariann:main
Mar 26, 2026
Merged

fix: repair TCP port-remapping slow path#3
congwang-mk merged 6 commits intomultikernel:mainfrom
ghazariann:main

Conversation

@ghazariann
Copy link
Copy Markdown
Contributor

Summary

Fixes #2. Four bugs in the TCP port-remapping slow path. The slow path (triggered when the virtual port is already taken) had not been tested (all existing tests ran sequentially, so the port was always free by the time the second sandbox started).

Bugs fixed

Bug 1 - _allocate_real_port held socket blocked the child's bind

The parent kept the allocated real port socket open to prevent stealing, then
rewrote the child's sockaddr to that port and sent CONTINUE. The child's bind() hit EADDRINUSE because the parent was already holding the same
address:port (only released later in PortMap.close()).
Fixed by closing it immediately, accepting the same tiny race window already mentioned in _try_reserve_port.

Bug 2 - os.pidfd_open not available in Python build

fixup_getsockname used os.pidfd_open which raises AttributeError on some builds.
Fixed by using _libc.syscall(_NR_PIDFD_OPEN) directly, consistent with the
rest of the codebase. In the future, it makes sense to separate all syscalls into
a different file like _syscall.py, since an identical _NR_PIDFD_OPEN was
found in _context.py.

Bug 3 - fixup_getsockname leaked the duplicated fd

s.detach() does not call os.close(), so local_fd was never closed on the
success path. The leaked fd kept the child's socket alive after the child
exited, holding the port bound in the parent. Fixed by moving\ os.close(local_fd) into the finally block.

Bug 4 - NotifSupervisor.stop() never called port_map.close()

Proxy listener sockets were left running after sandbox teardown. Fixed by
calling port_map.close() at the end of stop().

Tests

Two new tests added to TestPortRemap:

  • test_slow_path_host_holds_virtual_port: parent process holds the virtual
    port, forcing the slow path without threads or timing dependencies.
  • test_slow_path_two_concurrent_sandboxes: two sandboxes run concurrently; sandbox 1 holds the port open while sandbox 2 must remap.

Copy link
Copy Markdown
Contributor

@congwang-mk congwang-mk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. Some comments below

_NR_PIDFD_GETFD = 438 # x86_64 and aarch64 (asm-generic)


def _pidfd_open(pid: int) -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated with the one in _context.py ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I refrain from importing it from _context, should we separate all syscall wrappers into a different file, like _syscalls.py

t1 = threading.Thread(target=run, args=(0, code_hold))
t1.start()
import time
time.sleep(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be flaky, could it be improved?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 6e68e2a

@congwang-mk
Copy link
Copy Markdown
Contributor

Re: the _pidfd_open duplication — no need for a _syscalls.py refactor in this PR. Just remove the _pidfd_open function and the _NR_PIDFD_OPEN constant you added to _port_remap.py. Your fixup_getsockname already does from ._context import _pidfd_open, so the local copy is dead code.

@ghazariann
Copy link
Copy Markdown
Contributor Author

Re: the _pidfd_open duplication — no need for a _syscalls.py refactor in this PR. Just remove the _pidfd_open function and the _NR_PIDFD_OPEN constant you added to _port_remap.py. Your fixup_getsockname already does from ._context import _pidfd_open, so the local copy is dead code.

fixed in 2201c42

@congwang-mk congwang-mk merged commit bed191e into multikernel:main Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCP port remapping fails with EADDRINUSE when virtual port is already in use

2 participants