Fix background database account refresh stopping in multi-writer accounts#48758
Draft
jeet1995 wants to merge 4 commits intoAzure:mainfrom
Draft
Fix background database account refresh stopping in multi-writer accounts#48758jeet1995 wants to merge 4 commits intoAzure:mainfrom
jeet1995 wants to merge 4 commits intoAzure:mainfrom
Conversation
…unts In multi-writer accounts, refreshLocationPrivateAsync() stops the background refresh timer when shouldRefreshEndpoints() returns false. This means topology changes (e.g., multi-write to single-write transitions) go undetected until the next explicit refresh trigger. The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background refresh loop unconditionally - the loop only stops when canRefreshInBackground is explicitly false, not when shouldRefreshEndpoints returns false. This fix adds startRefreshLocationTimerAsync() to the else-branch of refreshLocationPrivateAsync(), ensuring the background timer always reschedules itself regardless of whether endpoints currently need refreshing. Without this fix, after a multi-write -> single-write -> multi-write transition, reads remain stuck on the primary region because the SDK never re-reads account metadata to learn about the restored multi-write topology. Related: PR Azure#6139 (point #4 in description acknowledged this bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- backgroundRefreshForMultiMaster: changed assertion from assertFalse to assertTrue - the timer must keep running even when shouldRefreshEndpoints returns false - backgroundRefreshDetectsTopologyChangeForMultiMaster: new test that proves the fix works by simulating a multi-write to single-write transition and verifying the background refresh detects the topology change All 6 GlobalEndPointManagerTest tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Charts and report from live DR drill validation of the background refresh fix on account bgrefresh-mw-test-440 (East US + West US). 3 drills over 51 minutes, 103 background refreshes, 305 reads, 0 errors: - Drill 1: MW→SW→MW transition (both detected) - Drill 2: Region removal + re-add (detected, DIRECT mode connections persist) - Drill 3: Failover priority change (fastest propagation ~1 min) Config: Global endpoint, preferred=West US, DIRECT mode, 30s refresh (overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS). Unpatched SDK: 0 refreshes (timer stops after init). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
DR Drill Test ResultsTest Configuration
Summary
Per-Drill ResultsDrill 1: MW→SW→MW Transition — Both transitions detected (3 min / 2 min metadata propagation lag) Drill 2: Region Removal + Re-add — Detected. DIRECT mode note: reads continue via existing RNTBD connections even after metadata removal. Drill 3: Failover Priority Change — All transitions detected. Fastest propagation (~1 min vs ~5 min for region removal). Summary Full report:
|
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Problem
The
GlobalEndpointManagerbackground refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).Root Cause
In
refreshLocationPrivateAsync(), whenLocationCache.shouldRefreshEndpoints()returnsfalse, the timer is never restarted:For multi-writer accounts,
shouldRefreshEndpoints()returnsfalsewhen the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).Behavioral Difference with .NET SDK
The .NET SDK (
azure-cosmos-dotnet-v3) handles this correctly inStartLocationBackgroundRefreshLoop():The .NET loop only terminates when
canRefreshInBackgroundis explicitlyfalse-- it continues even whenShouldRefreshEndpoints()returnsfalse.Fix
Add
startRefreshLocationTimerAsync()to theelsebranch ofrefreshLocationPrivateAsync():Unit Tests
6/6 pass:
backgroundRefreshForMultiMaster: Updated assertion fromassertFalsetoassertTrue-- timer must keep runningbackgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW to SW transition via mock, verifieswritableLocationsupdate from 2 to 1Live DR Drill Validation
Test Configuration
bgrefresh-mw-test-440(ephemeral tenant)bgrefresh-mw-test-440.documents.azure.com)COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS=30, default: 300s/5min)Patched vs Unpatched
Drill 1: MW to SW to MW Transition
Both transitions detected.
writableLocationschanged[EUS, WUS]to[EUS]to[EUS, WUS]with ~3 min / ~2 min metadata propagation lag.Drill 2: Region Removal + Re-add
Detected.
readableLocationsshrank from[EUS, WUS]to[EUS]then restored. DIRECT mode note: reads continued via existing RNTBD connections even after region metadata removal (by design).Drill 3: Failover Priority Change
All transitions detected.
failover-priority-changepropagated fastest (~1 min vs ~5 min for region removal).Summary
Changes
GlobalEndpointManager.java)GlobalEndPointManagerTest.java)