fix(coglet): propagate metric scope to async event loop thread#2902
fix(coglet): propagate metric scope to async event loop thread#2902michaeldwan merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
This PR fixes a critical billing bug where async metrics are lost under concurrency. The fix correctly propagates the metric scope ContextVar to the event loop thread and swaps the lookup order to prefer ContextVar over the process-wide mutex.
One minor suggestion:
|
Done. The PR correctly fixes the billing-critical metric propagation bug. The implementation:
I posted one minor suggestion about error message consistency in the |
|
Based on my review of PR #2902, this is a well-implemented fix for a critical billing bug. The changes correctly:
The implementation follows Rust best practices and maintains backward compatibility. All error handling is appropriate, and the code is well-documented with clear comments explaining the lookup order rationale. One minor nit: The error messages in I notice there's already a review from ask-bonk pointing out the error message consistency. Looking at the code, the current implementation already uses the suggested pattern: .map_err(|e| PredictionError::Failed(format!("Failed to get metric scope ContextVar: {}", e)))?;This matches the pattern and is appropriate. LGTM |
Async predictions with concurrency > 1 silently dropped metrics recorded via current_scope().record_metric(). Two root causes: 1. _ctx_wrapper only propagated the log ContextVar to the event loop thread, not the metric scope ContextVar. Coroutines couldn't find the active scope. 2. py_current_scope() checked the process-wide SYNC_SCOPE mutex first. Under concurrency this returns whichever scope was last entered by any worker thread -- the wrong one. The fix extends _ctx_wrapper to also set the metric scope ContextVar, swaps the lookup order in py_current_scope() to prefer the ContextVar (correct per-coroutine/per-thread), and extracts the shared wrapping logic into submit_async_coroutine(). Closes #2901
d82a151 to
56a4472
Compare
|
LGTM |
Async predictions with concurrency > 1 silently drop metrics recorded via
current_scope().record_metric(). This is a billing-critical bug -- metrics like token counts and latency are lost when running async predictors at scale.Closes #2901
Root cause
Two issues conspired:
_ctx_wrapper(the coroutine wrapper that sets up per-prediction context on the event loop thread) only propagated the log ContextVar, not the metric scope ContextVar. The async coroutine couldn't find the active scope, socurrent_scope()returned a noop.Even with the ContextVar propagated,
py_current_scope()checked the process-wideSYNC_SCOPEmutex first. Under concurrency > 1, this returns whichever scope was last entered by any worker thread -- cross-contaminating metrics between predictions.Fix
_ctx_wrapperto accept and set the metric scope ContextVar alongside the log ContextVarpy_current_scope()to prefer the ContextVar (per-coroutine for async, per-thread for sync) over the process-wide mutexsubmit_async_coroutine()to avoid duplication between predict and train pathsVerify
Integration test
TestConcurrentAsyncMetricsfires 5 concurrent async predictions, each recording a uniqueprediction_indexmetric, and asserts each response contains the correct value -- catching both silent drops and cross-contamination.