Skip to content

bug: Flaky TestSystemSymmetricConfigurationRoutesQuorumCalls/All #291

@meling

Description

@meling

Bug Description

TestSystemSymmetricConfigurationRoutesQuorumCalls/All fails intermittently with:

--- FAIL: TestSystemSymmetricConfigurationRoutesQuorumCalls/All (2.00s)
    system_test.go:333: quorum call error: quorum call error: incomplete call (errors: 0)

The 2.00s duration matches the TestContext(t, 2*time.Second) timeout exactly. The errors: 0 means Threshold(3) returned ErrIncomplete without any node-level errors — fewer than 3 successful responses arrived before the context deadline.

The failure was previously observed at ~1-2% rate with -count=100, but has not been reproduced recently (-count=10000 passed). The bug may have been partially mitigated by the requeuePendingMsgs deadlock fix, or it may require specific scheduling conditions that are more common on loaded CI machines.

Test Setup

The test creates 3 systems (bufconn, no real TCP) and runs 3 sequential subtests — Majority, First, All — sharing the same cfg := systems[0].OutboundConfig(). Each subtest creates a fresh context with a 2-second timeout. The All subtest calls Threshold(3), requiring a successful response from all 3 nodes. One of those nodes is the self-node (system 0 talking to itself), dispatched via dispatchLocalRequest in internal/stream/channel.go.

Relevant Files

  • internal/stream/channel.go: dispatchLocalRequest, sender, Enqueue
  • client_interceptor.go: defaultResponseSeq, send
  • responses.go: Threshold
  • system_test.go: TestSystemSymmetricConfigurationRoutesQuorumCalls

Debugging Approaches

1. Write a targeted stress test

The current test runs 3 sequential subtests. Write a new test that isolates the All path and runs many iterations with tight timeouts and runtime.Gosched() calls to increase scheduling variability. Consider using GOMAXPROCS=1 to serialize goroutine execution and amplify contention.

2. Use -race flag

go test -race -count=1000 -run TestSystemSymmetricConfigurationRoutesQuorumCalls .

A data race on localMu, stream, or the pending map might explain the non-determinism.

3. Use runtime/pprof profiling

Collect block, mutex, or goroutine profiles during test runs:

import "runtime/pprof"

pprof.Lookup("goroutine").WriteTo(os.Stderr, 1)   // dump goroutine stacks
pprof.Lookup("mutex").WriteTo(os.Stderr, 1)        // mutex contention
pprof.Lookup("block").WriteTo(os.Stderr, 1)        // blocking profile

The goroutine leak profile may require GOEXPERIMENT=goroutineleakprofile.

4. Use runtime/trace

Capture an execution trace during a high-iteration test run:

go test -trace=trace.out -run TestSystemSymmetricConfigurationRoutesQuorumCalls -count=1000 .
go tool trace trace.out

This shows goroutine scheduling, blocking events, and synchronization points, which should reveal whether send() or dispatchLocalRequest stalls.

5. Add per-node response timestamps

Modify the test to record when each response arrives (local vs. remote) to determine which node's response is missing when the call fails.

6. Make Enqueue context-aware

Currently, Enqueue only watches connCtx in the outbound path. Adding req.Ctx to the select would prevent send() from blocking past the call's deadline:

select {
case <-c.connCtx.Done():
    c.replyError(req, ErrNodeClosed)
case <-req.Ctx.Done():
    c.replyError(req, req.Ctx.Err())
case c.sendQ <- req:
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions