Skip to content

fix(coglet): propagate metric scope to async event loop thread#2902

Merged
michaeldwan merged 1 commit intomainfrom
md/async-metrics
Apr 1, 2026
Merged

fix(coglet): propagate metric scope to async event loop thread#2902
michaeldwan merged 1 commit intomainfrom
md/async-metrics

Conversation

@michaeldwan
Copy link
Copy Markdown
Member

Async predictions with concurrency > 1 silently drop metrics recorded via current_scope().record_metric(). This is a billing-critical bug -- metrics like token counts and latency are lost when running async predictors at scale.

Closes #2901

Root cause

Two issues conspired:

  1. _ctx_wrapper (the coroutine wrapper that sets up per-prediction context on the event loop thread) only propagated the log ContextVar, not the metric scope ContextVar. The async coroutine couldn't find the active scope, so current_scope() returned a noop.

  2. Even with the ContextVar propagated, py_current_scope() checked the process-wide SYNC_SCOPE mutex first. Under concurrency > 1, this returns whichever scope was last entered by any worker thread -- cross-contaminating metrics between predictions.

Fix

  • Extend _ctx_wrapper to accept and set the metric scope ContextVar alongside the log ContextVar
  • Swap the lookup order in py_current_scope() to prefer the ContextVar (per-coroutine for async, per-thread for sync) over the process-wide mutex
  • Extract the shared coroutine-wrapping logic into submit_async_coroutine() to avoid duplication between predict and train paths

Verify

Integration test TestConcurrentAsyncMetrics fires 5 concurrent async predictions, each recording a unique prediction_index metric, and asserts each response contains the correct value -- catching both silent drops and cross-contamination.

mise run test:rust            # 168/168 pass
mise run test:integration TestConcurrentAsyncMetrics

Copy link
Copy Markdown

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes a critical billing bug where async metrics are lost under concurrency. The fix correctly propagates the metric scope ContextVar to the event loop thread and swaps the lookup order to prefer ContextVar over the process-wide mutex.

One minor suggestion:

@ask-bonk
Copy link
Copy Markdown

ask-bonk bot commented Apr 1, 2026

Done. The PR correctly fixes the billing-critical metric propagation bug. The implementation:

  1. Correctly swaps lookup order in py_current_scope() to prefer ContextVar over SYNC_SCOPE mutex, preventing cross-contamination
  2. Properly propagates scope ContextVar to async event loop threads via the extended _ctx_wrapper
  3. Extracts shared logic into submit_async_coroutine() for both predict and train paths
  4. Includes comprehensive integration test that validates metrics aren't dropped or cross-contaminated under concurrent load

I posted one minor suggestion about error message consistency in the submit_async_coroutine() function.

github run

@michaeldwan michaeldwan marked this pull request as ready for review April 1, 2026 20:12
@michaeldwan michaeldwan requested a review from a team as a code owner April 1, 2026 20:12
@ask-bonk
Copy link
Copy Markdown

ask-bonk bot commented Apr 1, 2026

Based on my review of PR #2902, this is a well-implemented fix for a critical billing bug. The changes correctly:

  1. Propagate metric scope ContextVar to async event loop threads via the extended _ctx_wrapper
  2. Swap lookup order in py_current_scope() to prefer ContextVar (per-coroutine/per-thread) over the process-wide SYNC_SCOPE mutex, preventing cross-contamination
  3. Extract shared logic into submit_async_coroutine() to avoid duplication between predict and train paths
  4. Include comprehensive integration test that validates metrics aren't dropped or cross-contaminated under concurrent load

The implementation follows Rust best practices and maintains backward compatibility. All error handling is appropriate, and the code is well-documented with clear comments explaining the lookup order rationale.

One minor nit: The error messages in submit_async_coroutine() at lines 123 and 126 use slightly different patterns ("prediction ContextVar" vs "metric scope ContextVar"), but this is semantically correct since they refer to different ContextVars with different purposes.

I notice there's already a review from ask-bonk pointing out the error message consistency. Looking at the code, the current implementation already uses the suggested pattern:

.map_err(|e| PredictionError::Failed(format!("Failed to get metric scope ContextVar: {}", e)))?;

This matches the pattern and is appropriate.

LGTM

github run

Async predictions with concurrency > 1 silently dropped metrics recorded
via current_scope().record_metric(). Two root causes:

1. _ctx_wrapper only propagated the log ContextVar to the event loop
   thread, not the metric scope ContextVar. Coroutines couldn't find
   the active scope.

2. py_current_scope() checked the process-wide SYNC_SCOPE mutex first.
   Under concurrency this returns whichever scope was last entered by
   any worker thread -- the wrong one.

The fix extends _ctx_wrapper to also set the metric scope ContextVar,
swaps the lookup order in py_current_scope() to prefer the ContextVar
(correct per-coroutine/per-thread), and extracts the shared wrapping
logic into submit_async_coroutine().

Closes #2901
@ask-bonk
Copy link
Copy Markdown

ask-bonk bot commented Apr 1, 2026

LGTM

github run

@michaeldwan michaeldwan added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 98f6ad2 Apr 1, 2026
37 checks passed
@michaeldwan michaeldwan deleted the md/async-metrics branch April 1, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async predictions: metric scope ContextVar not propagated to event loop thread — billing metrics silently dropped under concurrency > 1

2 participants