fix(coglet): propagate metric scope to async event loop thread by michaeldwan · Pull Request #2902 · replicate/cog

michaeldwan · 2026-04-01T20:03:00Z

Async predictions with concurrency > 1 silently drop metrics recorded via current_scope().record_metric(). This is a billing-critical bug -- metrics like token counts and latency are lost when running async predictors at scale.

Closes #2901

Root cause

Two issues conspired:

_ctx_wrapper (the coroutine wrapper that sets up per-prediction context on the event loop thread) only propagated the log ContextVar, not the metric scope ContextVar. The async coroutine couldn't find the active scope, so current_scope() returned a noop.
Even with the ContextVar propagated, py_current_scope() checked the process-wide SYNC_SCOPE mutex first. Under concurrency > 1, this returns whichever scope was last entered by any worker thread -- cross-contaminating metrics between predictions.

Fix

Extend _ctx_wrapper to accept and set the metric scope ContextVar alongside the log ContextVar
Swap the lookup order in py_current_scope() to prefer the ContextVar (per-coroutine for async, per-thread for sync) over the process-wide mutex
Extract the shared coroutine-wrapping logic into submit_async_coroutine() to avoid duplication between predict and train paths

Verify

Integration test TestConcurrentAsyncMetrics fires 5 concurrent async predictions, each recording a unique prediction_index metric, and asserts each response contains the correct value -- catching both silent drops and cross-contamination.

mise run test:rust            # 168/168 pass
mise run test:integration TestConcurrentAsyncMetrics

ask-bonk

This PR fixes a critical billing bug where async metrics are lost under concurrency. The fix correctly propagates the metric scope ContextVar to the event loop thread and swaps the lookup order to prefer ContextVar over the process-wide mutex.

One minor suggestion:

crates/coglet-python/src/predictor.rs

ask-bonk · 2026-04-01T20:04:23Z

Done. The PR correctly fixes the billing-critical metric propagation bug. The implementation:

Correctly swaps lookup order in py_current_scope() to prefer ContextVar over SYNC_SCOPE mutex, preventing cross-contamination
Properly propagates scope ContextVar to async event loop threads via the extended _ctx_wrapper
Extracts shared logic into submit_async_coroutine() for both predict and train paths
Includes comprehensive integration test that validates metrics aren't dropped or cross-contaminated under concurrent load

I posted one minor suggestion about error message consistency in the submit_async_coroutine() function.

github run

ask-bonk · 2026-04-01T20:15:27Z

Based on my review of PR #2902, this is a well-implemented fix for a critical billing bug. The changes correctly:

Propagate metric scope ContextVar to async event loop threads via the extended _ctx_wrapper
Swap lookup order in py_current_scope() to prefer ContextVar (per-coroutine/per-thread) over the process-wide SYNC_SCOPE mutex, preventing cross-contamination
Extract shared logic into submit_async_coroutine() to avoid duplication between predict and train paths
Include comprehensive integration test that validates metrics aren't dropped or cross-contaminated under concurrent load

The implementation follows Rust best practices and maintains backward compatibility. All error handling is appropriate, and the code is well-documented with clear comments explaining the lookup order rationale.

One minor nit: The error messages in submit_async_coroutine() at lines 123 and 126 use slightly different patterns ("prediction ContextVar" vs "metric scope ContextVar"), but this is semantically correct since they refer to different ContextVars with different purposes.

I notice there's already a review from ask-bonk pointing out the error message consistency. Looking at the code, the current implementation already uses the suggested pattern:

.map_err(|e| PredictionError::Failed(format!("Failed to get metric scope ContextVar: {}", e)))?;

This matches the pattern and is appropriate.

LGTM

github run

Async predictions with concurrency > 1 silently dropped metrics recorded via current_scope().record_metric(). Two root causes: 1. _ctx_wrapper only propagated the log ContextVar to the event loop thread, not the metric scope ContextVar. Coroutines couldn't find the active scope. 2. py_current_scope() checked the process-wide SYNC_SCOPE mutex first. Under concurrency this returns whichever scope was last entered by any worker thread -- the wrong one. The fix extends _ctx_wrapper to also set the metric scope ContextVar, swaps the lookup order in py_current_scope() to prefer the ContextVar (correct per-coroutine/per-thread), and extracts the shared wrapping logic into submit_async_coroutine(). Closes #2901

ask-bonk · 2026-04-01T20:58:42Z

LGTM

github run

ask-bonk bot reviewed Apr 1, 2026

View reviewed changes

crates/coglet-python/src/predictor.rs Outdated Show resolved Hide resolved

michaeldwan marked this pull request as ready for review April 1, 2026 20:12

michaeldwan requested a review from a team as a code owner April 1, 2026 20:12

michaeldwan force-pushed the md/async-metrics branch from d82a151 to 56a4472 Compare April 1, 2026 20:56

markphelps approved these changes Apr 1, 2026

View reviewed changes

michaeldwan added this pull request to the merge queue Apr 1, 2026

Merged via the queue into main with commit 98f6ad2 Apr 1, 2026
37 checks passed

michaeldwan deleted the md/async-metrics branch April 1, 2026 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coglet): propagate metric scope to async event loop thread#2902

fix(coglet): propagate metric scope to async event loop thread#2902
michaeldwan merged 1 commit intomainfrom
md/async-metrics

michaeldwan commented Apr 1, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaeldwan commented Apr 1, 2026

Root cause

Fix

Verify

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

ask-bonk bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants