Fix background database account refresh stopping in multi-writer accounts by jeet1995 · Pull Request #48758 · Azure/azure-sdk-for-java

jeet1995 · 2026-04-10T00:51:02Z

Problem

The GlobalEndpointManager background refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).

Root Cause

In refreshLocationPrivateAsync(), when LocationCache.shouldRefreshEndpoints() returns false, the timer is never restarted:

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    this.isRefreshing.set(false);
    return Mono.empty(); // timer dies here
}

For multi-writer accounts, shouldRefreshEndpoints() returns false when the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).

Behavioral Difference with .NET SDK

The .NET SDK (azure-cosmos-dotnet-v3) handles this correctly in StartLocationBackgroundRefreshLoop():

if (!this.locationCache.ShouldRefreshEndpoints(out bool canRefreshInBackground))
{
    if (!canRefreshInBackground)
    {
        this.isBackgroundAccountRefreshActive = false;
        return; // only stops when canRefreshInBackground is explicitly false
    }
}
// otherwise: delay -> refresh -> loop again

The .NET loop only terminates when canRefreshInBackground is explicitly false -- it continues even when ShouldRefreshEndpoints() returns false.

Fix

Add startRefreshLocationTimerAsync() to the else branch of refreshLocationPrivateAsync():

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    if (!this.refreshInBackground.get()) {
        this.startRefreshLocationTimerAsync();
    }
    this.isRefreshing.set(false);
    return Mono.empty();
}

Unit Tests

6/6 pass:

backgroundRefreshForMultiMaster: Updated assertion from assertFalse to assertTrue -- timer must keep running
backgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW to SW transition via mock, verifies writableLocations update from 2 to 1

Live DR Drill Validation

Test Configuration

Setting	Value
Account	`bgrefresh-mw-test-440` (ephemeral tenant)
Regions	East US (hub) + West US
Endpoint	Global (`bgrefresh-mw-test-440.documents.azure.com`)
Preferred region	West US
Connection mode	DIRECT (RNTBD)
Background refresh	30s (overridden via `COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS=30`, default: 300s/5min)
SDK	azure-cosmos 4.80.0-beta.1 (patched)

Patched vs Unpatched

Metric	Patched	Unpatched
Background refreshes (51 min)	103	0
Metadata transitions detected	9/9	0/9
Reads	305	~20
Errors	0	0 (but topology changes never detected)

Drill 1: MW to SW to MW Transition

Both transitions detected. writableLocations changed [EUS, WUS] to [EUS] to [EUS, WUS] with ~3 min / ~2 min metadata propagation lag.

Drill 2: Region Removal + Re-add

Detected. readableLocations shrank from [EUS, WUS] to [EUS] then restored. DIRECT mode note: reads continued via existing RNTBD connections even after region metadata removal (by design).

Drill 3: Failover Priority Change

All transitions detected. failover-priority-change propagated fastest (~1 min vs ~5 min for region removal).

Summary

Changes

1 file changed, 10 insertions (GlobalEndpointManager.java)
1 file changed, 50 insertions, 1 deletion (GlobalEndPointManagerTest.java)

…unts In multi-writer accounts, refreshLocationPrivateAsync() stops the background refresh timer when shouldRefreshEndpoints() returns false. This means topology changes (e.g., multi-write to single-write transitions) go undetected until the next explicit refresh trigger. The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background refresh loop unconditionally - the loop only stops when canRefreshInBackground is explicitly false, not when shouldRefreshEndpoints returns false. This fix adds startRefreshLocationTimerAsync() to the else-branch of refreshLocationPrivateAsync(), ensuring the background timer always reschedules itself regardless of whether endpoints currently need refreshing. Without this fix, after a multi-write -> single-write -> multi-write transition, reads remain stuck on the primary region because the SDK never re-reads account metadata to learn about the restored multi-write topology. Related: PR Azure#6139 (point #4 in description acknowledged this bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- backgroundRefreshForMultiMaster: changed assertion from assertFalse to assertTrue - the timer must keep running even when shouldRefreshEndpoints returns false - backgroundRefreshDetectsTopologyChangeForMultiMaster: new test that proves the fix works by simulating a multi-write to single-write transition and verifying the background refresh detects the topology change All 6 GlobalEndPointManagerTest tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Charts and report from live DR drill validation of the background refresh fix on account bgrefresh-mw-test-440 (East US + West US). 3 drills over 51 minutes, 103 background refreshes, 305 reads, 0 errors: - Drill 1: MW→SW→MW transition (both detected) - Drill 2: Region removal + re-add (detected, DIRECT mode connections persist) - Drill 3: Failover priority change (fastest propagation ~1 min) Config: Global endpoint, preferred=West US, DIRECT mode, 30s refresh (overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS). Unpatched SDK: 0 refreshes (timer stops after init). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-04-10T14:37:07Z

DR Drill Test Results

Test Configuration

Setting	Value
Account	`bgrefresh-mw-test-440` (ephemeral tenant)
Regions	East US (hub) + West US
Endpoint	Global (`bgrefresh-mw-test-440.documents.azure.com`)
Preferred region	West US
Connection mode	DIRECT (RNTBD)
Background refresh	30s (overridden via `COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS=30`, default: 300s)
SDK	azure-cosmos 4.80.0-beta.1 (patched with this fix)

Summary

Metric	Patched	Unpatched
Background refreshes (51 min)	103	0
Metadata transitions detected	9/9	0/9
Read errors	0	0 (but topology changes never detected)
Read availability	100%	100% (stuck on stale routing)

Per-Drill Results

Drill 1: MW→SW→MW Transition — Both transitions detected (3 min / 2 min metadata propagation lag)

Drill 2: Region Removal + Re-add — Detected. DIRECT mode note: reads continue via existing RNTBD connections even after metadata removal.

Drill 3: Failover Priority Change — All transitions detected. Fastest propagation (~1 min vs ~5 min for region removal).

Summary

Full report: `dr-drill-report.md`

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 added the Cosmos label Apr 10, 2026

jeet1995 and others added 2 commits April 9, 2026 21:27

Remove test artifacts from PR branch

c95fb7b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix background database account refresh stopping in multi-writer accounts#48758

Fix background database account refresh stopping in multi-writer accounts#48758
jeet1995 wants to merge 4 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer

jeet1995 commented Apr 10, 2026 •

edited

Loading

Uh oh!

jeet1995 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeet1995 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Behavioral Difference with .NET SDK

Fix

Unit Tests

Live DR Drill Validation

Test Configuration

Patched vs Unpatched

Drill 1: MW to SW to MW Transition

Drill 2: Region Removal + Re-add

Drill 3: Failover Priority Change

Summary

Changes

Uh oh!

jeet1995 commented Apr 10, 2026

DR Drill Test Results

Test Configuration

Summary

Per-Drill Results

Full report: dr-drill-report.md

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeet1995 commented Apr 10, 2026 •

edited

Loading

Full report: `dr-drill-report.md`