Fix background database account refresh stopping in multi-writer accounts#48758
Merged
jeet1995 merged 7 commits intoAzure:mainfrom Apr 18, 2026
Merged
Conversation
c95fb7b to
2048abe
Compare
jeet1995
added a commit
to jeet1995/azure-sdk-for-java
that referenced
this pull request
Apr 11, 2026
…W switch, SW offline) Kusto-backed evidence with charts for PR Azure#48758 validation. Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer) Branch: fix/background-refresh-multi-writer @ 2048abe Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…unts In multi-writer accounts, refreshLocationPrivateAsync() stops the background refresh timer when shouldRefreshEndpoints() returns false. This means topology changes (e.g., multi-write to single-write transitions) go undetected until the next explicit refresh trigger. The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background refresh loop unconditionally - the loop only stops when canRefreshInBackground is explicitly false, not when shouldRefreshEndpoints returns false. This fix adds startRefreshLocationTimerAsync() to the else-branch of refreshLocationPrivateAsync(), ensuring the background timer always reschedules itself regardless of whether endpoints currently need refreshing. Without this fix, after a multi-write -> single-write -> multi-write transition, reads remain stuck on the primary region because the SDK never re-reads account metadata to learn about the restored multi-write topology. Unit tests updated: - backgroundRefreshForMultiMaster: assertTrue (timer must keep running) - backgroundRefreshDetectsTopologyChangeForMultiMaster: new test proving MW->SW transition detection via mock Related: PR Azure#6139 (point #4 in description acknowledged this bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…W switch, SW offline) Kusto-backed evidence with charts for PR Azure#48758 validation. Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer) Branch: fix/background-refresh-multi-writer @ 2048abe Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions, SW switch, SW offline)" This reverts commit c9fc5c4.
The forceRefresh=true path in refreshLocationAsync() updates the LocationCache but never restarts the background timer. After a MW→SW transition triggered by 403/3, the timer stays dead and the SDK never detects MW re-enablement — traffic stays pinned to the SW write region permanently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dd36930 to
4b32867
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a long-standing issue in the Cosmos Java SDK where GlobalEndpointManager’s background database-account refresh could stop permanently for multi-writer accounts, preventing steady-state detection of topology changes (e.g., MW ↔ SW transitions).
Changes:
- Ensure the background refresh loop reschedules even when
shouldRefreshEndpoints()returnsfalse. - Restart the background refresh timer after force-refresh (e.g., 403/3-driven) when it isn’t running.
- Add optional refresh jitter (default 0–15s) via
COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDSand update unit tests to disable jitter for determinism.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java | Keeps the background refresh timer alive in MW steady state; restarts timer after force-refresh; adds configurable scheduling jitter. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Introduces a new system-property-backed config for max jitter seconds (default 15). |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/GlobalEndPointManagerTest.java | Updates/extends unit tests to validate timer continuity and topology-change detection; disables jitter for stable assertions. |
fbb56eb to
a09826f
Compare
…ng herd Configurable via COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDS (default 15). Spreads refresh calls from many CosmosClient instances to avoid overwhelming the compute gateway. Jitter is skipped during initialization (zero delay for first refresh). Tests set jitter to 0 for deterministic behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a09826f to
c4e5f36
Compare
|
Azure Pipelines successfully started running 1 pipeline(s). |
…rTest The background refresh jitter (0-15s) added to prevent thundering herd causes the refresh interval to exceed the 2-second sleep windows used by this test. Disable jitter so the background refresh fires predictably. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
|
@sdkReviewAgent |
Member
|
✅ Review complete (40:42) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Member
Author
|
/check-enforcer override |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
GlobalEndpointManagerbackground refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting account topology changes in steady state without aCosmosClientrestart.Root Cause: The background refresh loop in
refreshLocationPrivateAsync()has two branches: whenshouldRefreshEndpoints()returnstrue, it refreshes and restarts the timer. When it returnsfalse, it does nothing and crucially, never restarts the timer:For MW accounts,
shouldRefreshEndpoints()returnsfalseonce the preferred write endpoint matches the current hub which is immediate steady state. The timer fires once after init, enters this branch, and stops forever. Bug since PR #6139 (Nov 2019).Fix
refreshLocationPrivateAsync(). Aligns with .NET SDK.COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDS(default 15). Prevents thundering herd.DR Drill Validation
Hub region failover priority change on two MW accounts (routing GW + compute GW). Direct + Gateway modes, 25 min each. DR at T+10 min.
DR Drill Steps
mvn installazure-cosmos from PR branch, thenmvn package -Ppackage-assemblythe benchmark JARaz loginto the test tenant and set subscription-fe= compute GWComputeRequest5M, elseRequest5M)dr-drill-dbwith read + write containers at 400 RUmaxRunningTimeDuration: PT25M, distinct user agents per mode/opT_BASELINEaz cosmosdb failover-priority-changeto shift hub region, recordT_DR_STARTT_BENCHMARK_ENDrender timechartonBackendEndRequest5M+Request5M/ComputeRequest5M, verify region shiftReproduce — Drill 1 (Routing GW, DR at 16:51Z)
Reproduce — Drill 2 (Compute GW, DR at 17:50Z)