-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Bug Description
TestSystemSymmetricConfigurationRoutesQuorumCalls/All fails intermittently with:
--- FAIL: TestSystemSymmetricConfigurationRoutesQuorumCalls/All (2.00s)
system_test.go:333: quorum call error: quorum call error: incomplete call (errors: 0)
The 2.00s duration matches the TestContext(t, 2*time.Second) timeout exactly. The errors: 0 means Threshold(3) returned ErrIncomplete without any node-level errors — fewer than 3 successful responses arrived before the context deadline.
The failure was previously observed at ~1-2% rate with -count=100, but has not been reproduced recently (-count=10000 passed). The bug may have been partially mitigated by the requeuePendingMsgs deadlock fix, or it may require specific scheduling conditions that are more common on loaded CI machines.
Test Setup
The test creates 3 systems (bufconn, no real TCP) and runs 3 sequential subtests — Majority, First, All — sharing the same cfg := systems[0].OutboundConfig(). Each subtest creates a fresh context with a 2-second timeout. The All subtest calls Threshold(3), requiring a successful response from all 3 nodes. One of those nodes is the self-node (system 0 talking to itself), dispatched via dispatchLocalRequest in internal/stream/channel.go.
Relevant Files
internal/stream/channel.go:dispatchLocalRequest,sender,Enqueueclient_interceptor.go:defaultResponseSeq,sendresponses.go:Thresholdsystem_test.go:TestSystemSymmetricConfigurationRoutesQuorumCalls
Debugging Approaches
1. Write a targeted stress test
The current test runs 3 sequential subtests. Write a new test that isolates the All path and runs many iterations with tight timeouts and runtime.Gosched() calls to increase scheduling variability. Consider using GOMAXPROCS=1 to serialize goroutine execution and amplify contention.
2. Use -race flag
go test -race -count=1000 -run TestSystemSymmetricConfigurationRoutesQuorumCalls .A data race on localMu, stream, or the pending map might explain the non-determinism.
3. Use runtime/pprof profiling
Collect block, mutex, or goroutine profiles during test runs:
import "runtime/pprof"
pprof.Lookup("goroutine").WriteTo(os.Stderr, 1) // dump goroutine stacks
pprof.Lookup("mutex").WriteTo(os.Stderr, 1) // mutex contention
pprof.Lookup("block").WriteTo(os.Stderr, 1) // blocking profileThe goroutine leak profile may require GOEXPERIMENT=goroutineleakprofile.
4. Use runtime/trace
Capture an execution trace during a high-iteration test run:
go test -trace=trace.out -run TestSystemSymmetricConfigurationRoutesQuorumCalls -count=1000 .
go tool trace trace.outThis shows goroutine scheduling, blocking events, and synchronization points, which should reveal whether send() or dispatchLocalRequest stalls.
5. Add per-node response timestamps
Modify the test to record when each response arrives (local vs. remote) to determine which node's response is missing when the call fails.
6. Make Enqueue context-aware
Currently, Enqueue only watches connCtx in the outbound path. Adding req.Ctx to the select would prevent send() from blocking past the call's deadline:
select {
case <-c.connCtx.Done():
c.replyError(req, ErrNodeClosed)
case <-req.Ctx.Done():
c.replyError(req, req.Ctx.Err())
case c.sendQ <- req:
}