Skip to content

Fix background database account refresh stopping in multi-writer accounts#48758

Draft
jeet1995 wants to merge 4 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer
Draft

Fix background database account refresh stopping in multi-writer accounts#48758
jeet1995 wants to merge 4 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 10, 2026

Problem

The GlobalEndpointManager background refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).

Root Cause

In refreshLocationPrivateAsync(), when LocationCache.shouldRefreshEndpoints() returns false, the timer is never restarted:

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    this.isRefreshing.set(false);
    return Mono.empty(); // timer dies here
}

For multi-writer accounts, shouldRefreshEndpoints() returns false when the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).

Behavioral Difference with .NET SDK

The .NET SDK (azure-cosmos-dotnet-v3) handles this correctly in StartLocationBackgroundRefreshLoop():

if (!this.locationCache.ShouldRefreshEndpoints(out bool canRefreshInBackground))
{
    if (!canRefreshInBackground)
    {
        this.isBackgroundAccountRefreshActive = false;
        return; // only stops when canRefreshInBackground is explicitly false
    }
}
// otherwise: delay -> refresh -> loop again

The .NET loop only terminates when canRefreshInBackground is explicitly false -- it continues even when ShouldRefreshEndpoints() returns false.

Fix

Add startRefreshLocationTimerAsync() to the else branch of refreshLocationPrivateAsync():

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    if (!this.refreshInBackground.get()) {
        this.startRefreshLocationTimerAsync();
    }
    this.isRefreshing.set(false);
    return Mono.empty();
}

Unit Tests

6/6 pass:

  • backgroundRefreshForMultiMaster: Updated assertion from assertFalse to assertTrue -- timer must keep running
  • backgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW to SW transition via mock, verifies writableLocations update from 2 to 1

Live DR Drill Validation

Test Configuration

Setting Value
Account bgrefresh-mw-test-440 (ephemeral tenant)
Regions East US (hub) + West US
Endpoint Global (bgrefresh-mw-test-440.documents.azure.com)
Preferred region West US
Connection mode DIRECT (RNTBD)
Background refresh 30s (overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS=30, default: 300s/5min)
SDK azure-cosmos 4.80.0-beta.1 (patched)

Patched vs Unpatched

Metric Patched Unpatched
Background refreshes (51 min) 103 0
Metadata transitions detected 9/9 0/9
Reads 305 ~20
Errors 0 0 (but topology changes never detected)

Drill 1: MW to SW to MW Transition

Drill 1

Both transitions detected. writableLocations changed [EUS, WUS] to [EUS] to [EUS, WUS] with ~3 min / ~2 min metadata propagation lag.

Drill 2: Region Removal + Re-add

Drill 2

Detected. readableLocations shrank from [EUS, WUS] to [EUS] then restored. DIRECT mode note: reads continued via existing RNTBD connections even after region metadata removal (by design).

Drill 3: Failover Priority Change

Drill 3

All transitions detected. failover-priority-change propagated fastest (~1 min vs ~5 min for region removal).

Summary

Summary

Changes

  • 1 file changed, 10 insertions (GlobalEndpointManager.java)
  • 1 file changed, 50 insertions, 1 deletion (GlobalEndPointManagerTest.java)

…unts

In multi-writer accounts, refreshLocationPrivateAsync() stops the background
refresh timer when shouldRefreshEndpoints() returns false. This means topology
changes (e.g., multi-write to single-write transitions) go undetected until
the next explicit refresh trigger.

The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background
refresh loop unconditionally - the loop only stops when canRefreshInBackground
is explicitly false, not when shouldRefreshEndpoints returns false.

This fix adds startRefreshLocationTimerAsync() to the else-branch of
refreshLocationPrivateAsync(), ensuring the background timer always reschedules
itself regardless of whether endpoints currently need refreshing.

Without this fix, after a multi-write -> single-write -> multi-write transition,
reads remain stuck on the primary region because the SDK never re-reads account
metadata to learn about the restored multi-write topology.

Related: PR Azure#6139 (point #4 in description acknowledged this bug)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jeet1995 and others added 2 commits April 9, 2026 21:27
- backgroundRefreshForMultiMaster: changed assertion from assertFalse to
  assertTrue - the timer must keep running even when shouldRefreshEndpoints
  returns false
- backgroundRefreshDetectsTopologyChangeForMultiMaster: new test that proves
  the fix works by simulating a multi-write to single-write transition and
  verifying the background refresh detects the topology change

All 6 GlobalEndPointManagerTest tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Charts and report from live DR drill validation of the background
refresh fix on account bgrefresh-mw-test-440 (East US + West US).

3 drills over 51 minutes, 103 background refreshes, 305 reads, 0 errors:
- Drill 1: MW→SW→MW transition (both detected)
- Drill 2: Region removal + re-add (detected, DIRECT mode connections persist)
- Drill 3: Failover priority change (fastest propagation ~1 min)

Config: Global endpoint, preferred=West US, DIRECT mode, 30s refresh
(overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS).
Unpatched SDK: 0 refreshes (timer stops after init).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

DR Drill Test Results

Test Configuration

Setting Value
Account bgrefresh-mw-test-440 (ephemeral tenant)
Regions East US (hub) + West US
Endpoint Global (bgrefresh-mw-test-440.documents.azure.com)
Preferred region West US
Connection mode DIRECT (RNTBD)
Background refresh 30s (overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS=30, default: 300s)
SDK azure-cosmos 4.80.0-beta.1 (patched with this fix)

Summary

Metric Patched Unpatched
Background refreshes (51 min) 103 0
Metadata transitions detected 9/9 0/9
Read errors 0 0 (but topology changes never detected)
Read availability 100% 100% (stuck on stale routing)

Per-Drill Results

Drill 1: MW→SW→MW Transition — Both transitions detected (3 min / 2 min metadata propagation lag)

Drill 1

Drill 2: Region Removal + Re-add — Detected. DIRECT mode note: reads continue via existing RNTBD connections even after metadata removal.

Drill 2

Drill 3: Failover Priority Change — All transitions detected. Fastest propagation (~1 min vs ~5 min for region removal).

Drill 3

Summary

Summary

Full report: dr-drill-report.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant