Skip to content

Fix background database account refresh stopping in multi-writer accounts#48758

Merged
jeet1995 merged 7 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer
Apr 18, 2026
Merged

Fix background database account refresh stopping in multi-writer accounts#48758
jeet1995 merged 7 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 10, 2026

Problem

The GlobalEndpointManager background refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting account topology changes in steady state without a CosmosClient restart.

Root Cause: The background refresh loop in refreshLocationPrivateAsync() has two branches: when shouldRefreshEndpoints() returns true, it refreshes and restarts the timer. When it returns false, it does nothing and crucially, never restarts the timer:

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    this.isRefreshing.set(false);
    return Mono.empty(); // timer dies here never rescheduled
}

For MW accounts, shouldRefreshEndpoints() returns false once the preferred write endpoint matches the current hub which is immediate steady state. The timer fires once after init, enters this branch, and stops forever. Bug since PR #6139 (Nov 2019).

Fix

  1. Keep timer alive - restart in else branch of refreshLocationPrivateAsync(). Aligns with .NET SDK.
  2. Restart on force-refresh - after 403/3-driven refresh, restart timer if not running.
  3. Jitter (0-15s) - COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDS (default 15). Prevents thundering herd.

DR Drill Validation

Hub region failover priority change on two MW accounts (routing GW + compute GW). Direct + Gateway modes, 25 min each. DR at T+10 min.

DR Drill Steps

  1. Build - mvn install azure-cosmos from PR branch, then mvn package -Ppackage-assembly the benchmark JAR
  2. Auth - az login to the test tenant and set subscription
  3. Verify account - Confirm multi-writer enabled, note regions and failover priorities
  4. nslookup - Resolve account DNS to determine gateway type (-fe = compute GW ComputeRequest5M, else Request5M)
  5. Create DB/containers - dr-drill-db with read + write containers at 400 RU
  6. Create workload configs - Direct + Gateway JSONs with maxRunningTimeDuration: PT25M, distinct user agents per mode/op
  7. Start benchmarks - Launch Direct + Gateway processes simultaneously, record T_BASELINE
  8. Wait 10 min - Baseline stabilization
  9. Execute DR - az cosmosdb failover-priority-change to shift hub region, record T_DR_START
  10. Wait 15 min - SDK background refresh detects topology change
  11. Benchmarks auto-stop - Record T_BENCHMARK_END
  12. Run Kusto queries - render timechart on BackendEndRequest5M + Request5M/ComputeRequest5M, verify region shift

Reproduce — Drill 1 (Routing GW, DR at 16:51Z)

// Direct mode — BackendEndRequest5M
BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-04-15T16:41:00Z) .. datetime(2026-04-15T17:10:00Z))
| where GlobalDatabaseAccountName == '<routing-gw-mw-account>'
| where UserAgent has 'dr-bgrefresh-direct'
| where ResourceType == 2
| extend Series = strcat(Region, " | ", extract('(dr-bgrefresh-direct-[a-z]+)', 1, UserAgent))
| summarize Requests = sum(SampleCount) by bin(TIMESTAMP, 1m), Series
| render timechart
image
// Gateway mode — Request5M
Request5M
| where TIMESTAMP betwee
<img width="1035" height="169" alt="image" src="https://github.com/user-attachments/assets/6fba7778-4e69-4ce9-9930-bcad768386eb" />
n (datetime(2026-04-15T16:41:00Z) .. datetime(2026-04-15T17:10:00Z))
| where globalDatabaseAccountName == '<routing-gw-mw-account>'
| where userAgent has 'dr-bgrefresh-gw'
| extend Series = strcat(region, " | ", extract('(dr-bgrefresh-gw-[a-z]+)', 1, userAgent))
| summarize Requests = sum(SampleCount) by bin(TIMESTAMP, 1m), Series
| render timechart
image

Reproduce — Drill 2 (Compute GW, DR at 17:50Z)

// Direct mode — BackendEndRequest5M
BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-04-15T17:40:00Z) .. datetime(2026-04-15T18:10:00Z))
| where GlobalDatabaseAccountName == '<compute-gw-mw-account>'
| where UserAgent has 'dr-bgrefresh-fe-direct'
| where ResourceType == 2
| extend Series = strcat(Region, " | ", extract('(dr-bgrefresh-fe-direct-[a-z]+)', 1, UserAgent))
| summarize Requests = sum(SampleCount) by bin(TIMESTAMP, 1m), Series
| render timechart
image
// Gateway mode — ComputeRequest5M
ComputeRequest5M
| where TIMESTAMP between (datetime(2026-04-15T17:40:00Z) .. datetime(2026-04-15T18:10:00Z))
| where GlobalDatabaseAccountName == '<compute-gw-mw-account>'
| where UserAgent has 'dr-bgrefresh-fe-gw'
| extend Series = strcat(Region, " | ", extract('(dr-bgrefresh-fe-gw-[a-z]+)', 1, UserAgent))
| summarize Requests = sum(SampleCount) by bin(TIMESTAMP, 1m), Series
| render timechart
image

@jeet1995 jeet1995 force-pushed the fix/background-refresh-multi-writer branch from c95fb7b to 2048abe Compare April 10, 2026 20:51
jeet1995 added a commit to jeet1995/azure-sdk-for-java that referenced this pull request Apr 11, 2026
…W switch, SW offline)

Kusto-backed evidence with charts for PR Azure#48758 validation.
Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer)
Branch: fix/background-refresh-multi-writer @ 2048abe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 4 commits April 15, 2026 10:35
…unts

In multi-writer accounts, refreshLocationPrivateAsync() stops the background
refresh timer when shouldRefreshEndpoints() returns false. This means topology
changes (e.g., multi-write to single-write transitions) go undetected until
the next explicit refresh trigger.

The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background
refresh loop unconditionally - the loop only stops when canRefreshInBackground
is explicitly false, not when shouldRefreshEndpoints returns false.

This fix adds startRefreshLocationTimerAsync() to the else-branch of
refreshLocationPrivateAsync(), ensuring the background timer always reschedules
itself regardless of whether endpoints currently need refreshing.

Without this fix, after a multi-write -> single-write -> multi-write transition,
reads remain stuck on the primary region because the SDK never re-reads account
metadata to learn about the restored multi-write topology.

Unit tests updated:
- backgroundRefreshForMultiMaster: assertTrue (timer must keep running)
- backgroundRefreshDetectsTopologyChangeForMultiMaster: new test proving
  MW->SW transition detection via mock

Related: PR Azure#6139 (point #4 in description acknowledged this bug)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…W switch, SW offline)

Kusto-backed evidence with charts for PR Azure#48758 validation.
Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer)
Branch: fix/background-refresh-multi-writer @ 2048abe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions, SW switch, SW offline)"

This reverts commit c9fc5c4.
The forceRefresh=true path in refreshLocationAsync() updates the
LocationCache but never restarts the background timer. After a
MW→SW transition triggered by 403/3, the timer stays dead and the
SDK never detects MW re-enablement — traffic stays pinned to the
SW write region permanently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the fix/background-refresh-multi-writer branch 2 times, most recently from dd36930 to 4b32867 Compare April 15, 2026 20:51
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 marked this pull request as ready for review April 15, 2026 21:35
@jeet1995 jeet1995 requested review from a team and kirankumarkolli as code owners April 15, 2026 21:35
Copilot AI review requested due to automatic review settings April 15, 2026 21:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a long-standing issue in the Cosmos Java SDK where GlobalEndpointManager’s background database-account refresh could stop permanently for multi-writer accounts, preventing steady-state detection of topology changes (e.g., MW ↔ SW transitions).

Changes:

  • Ensure the background refresh loop reschedules even when shouldRefreshEndpoints() returns false.
  • Restart the background refresh timer after force-refresh (e.g., 403/3-driven) when it isn’t running.
  • Add optional refresh jitter (default 0–15s) via COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDS and update unit tests to disable jitter for determinism.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java Keeps the background refresh timer alive in MW steady state; restarts timer after force-refresh; adds configurable scheduling jitter.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java Introduces a new system-property-backed config for max jitter seconds (default 15).
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/GlobalEndPointManagerTest.java Updates/extends unit tests to validate timer continuity and topology-change detection; disables jitter for stable assertions.

@jeet1995 jeet1995 force-pushed the fix/background-refresh-multi-writer branch 2 times, most recently from fbb56eb to a09826f Compare April 15, 2026 21:55
…ng herd

Configurable via COSMOS.BACKGROUND_REFRESH_LOCATION_JITTER_MAX_IN_SECONDS
(default 15). Spreads refresh calls from many CosmosClient instances to
avoid overwhelming the compute gateway.

Jitter is skipped during initialization (zero delay for first refresh).
Tests set jitter to 0 for deterministic behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the fix/background-refresh-multi-writer branch from a09826f to c4e5f36 Compare April 15, 2026 22:02
@Azure Azure deleted a comment from azure-pipelines bot Apr 16, 2026
@Azure Azure deleted a comment from azure-pipelines bot Apr 16, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…rTest

The background refresh jitter (0-15s) added to prevent thundering herd
causes the refresh interval to exceed the 2-second sleep windows used
by this test. Disable jitter so the background refresh fires predictably.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member

Review complete (40:42)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@jeet1995
Copy link
Copy Markdown
Member Author

/check-enforcer override

@jeet1995 jeet1995 enabled auto-merge (squash) April 18, 2026 01:19
@jeet1995 jeet1995 merged commit 08021c9 into Azure:main Apr 18, 2026
88 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants