refactor(sysMain): atomic last-good cache + exponential backoff on CoinGecko 429#15
refactor(sysMain): atomic last-good cache + exponential backoff on CoinGecko 429#15
Conversation
The sysMain aggregator was a single setInterval tick that hit CoinGecko first and then issued a series of RPC calls, writing each field directly to the shared dataStore as it went. When CoinGecko 429'd the host (which happens regularly on shared public cloud IPs), the top-level try/catch swallowed the rejection and every subsequent RPC call was skipped. On a cold start this left dataStore at its initialised zero defaults, so /mnstats served "mn_enabled: 0", "next_superblock: SB0 - 0" and, more consequentially for the governance wizard, "superblock_next_epoch_sec: 0" - permanently disabling the new "Prepare proposal" flow that relies on a live next-SB anchor. A retry at the fixed 20s interval also meant the host kept hammering CoinGecko through the 429 window instead of letting the rate-limit bucket refill. This commit restructures the tick so: 1. Atomic last-good commit. Each tick assembles its values into a local `next` object; Object.assign(data, next) only fires once the whole payload has resolved. A failed tick leaves dataStore untouched - either at the previous good snapshot or at its initial defaults on cold start - so a CoinGecko outage can never produce a dataStore with fresh price data wired up to stale/zero chain data or vice versa. 2. Per-sb budget isolation. The projected sb1..sb5 getSuperblockBudget calls are resolved via Promise.allSettled with per-sb "To be determined" fallbacks, preserving the prior fire-and-forget .catch semantics while keeping the resolution inside the atomic tick boundary. 3. Exponential backoff. setInterval is replaced with a recursive setTimeout driven by a per-tick currentTickMs that doubles on failure up to a 5-minute ceiling and snaps back to the 20s base on the first successful commit. The timer is .unref()'d so it doesn't keep the event loop alive in test / CLI contexts. 4. Testability. Auto-start is suppressed under NODE_ENV=test (Jest's default), and a small `__resetForTests` hook is exported so unit tests can drive the loop deterministically without jest.resetModules tearing mock bindings loose. Also exported: tick, start, stop, getDiagnostics, BASE_TICK_MS, MAX_BACKOFF_MS. Business logic is unchanged: same RPC calls in the same order, same fields written with the same formulas, same response shape out of /mnstats. What changes is strictly the error-boundary and commit-point semantics. New unit tests (services/sysMain.test.js) cover: happy-path field population, backoff reset on success, cold-start 429 leaves defaults intact, good-snapshot preservation across subsequent 429s, mid-tick RPC failure produces no partial writes, per-sb budget failure falls back without aborting the tick, and exponential backoff capping at MAX_BACKOFF_MS. Full suite: 864/864 passing. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5b52311b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| async function runAndReschedule() { | ||
| await tick(); | ||
| scheduleNext(); |
There was a problem hiding this comment.
Reschedule polling even when a tick stalls
runAndReschedule waits for tick() to resolve before scheduling the next timeout, so a single hung external call (e.g., CoinGecko or RPC never resolving) permanently stops the loop and freezes /api/mnstats at stale data. This is a regression from the previous setInterval behavior, which kept firing even if one iteration got stuck. In production this can happen during network stalls because none of the awaited calls in tick() are wrapped with a timeout/guard, so scheduleNext() is never reached.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 6232489.
The rescheduling path now doesn't depend on tick() ever resolving:
runAndReschedule()allocates the tick's generation up front and races thetick()promise against aTICK_WATCHDOG_MS(60s) timer via a one-shotrescheduleOnce()helper. Whichever fires first schedules the next tick; a stuck tick can no longer freeze the loop.- When the watchdog wins it marks the tick's gen as completed (
lastCompletedGen = gen) and advances the backoff, so a stalled tick that eventually resolves much later seesgen <= lastCompletedGenand no-ops — no late overwrite ofdataStore, no late rollback of a newer commit'sBASE_TICK_MSreset, no double-apply of backoff by a late rejection. - For defence-in-depth the CoinGecko axios call also gets
{ timeout: 10_000 }so the HTTP layer surfaces an error long before the watchdog is the primary signal. (The SyscoinRpcClient doesn't expose a per-call timeout config knob; the watchdog is the universal safety net there.)
Covered by three new tests in services/sysMain.test.js: a stalled tick that eventually resolves is discarded…, a late-rejecting stalled tick does NOT undo backoff…, and watchdog reschedules the next tick even if the current one never resolves (fake-timer driven). Full backend suite green at 868/868.
Codex round 1 P1 on PR #15: rescheduling the next tick only from `await tick()` meant that a hung external call (CoinGecko or any RPC with no configured socket timeout) would freeze the loop permanently - a regression vs the prior `setInterval`, which kept firing every 20s regardless of whether an iteration was stuck. Three coupled changes address this: 1. Per-tick watchdog. `runAndReschedule` now allocates the tick's generation up front and races the tick against a TICK_WATCHDOG_MS (60s) timer. Whichever fires first schedules the next tick via a one-shot `rescheduleOnce()`. If the watchdog wins, it also marks the tick's gen as completed and advances the backoff so the stuck tick's eventual resolution (if any) is a no-op. 2. Monotonic `lastCompletedGen` guard. Both the commit path and the reject path in `tick()` now check `gen <= lastCompletedGen` before publishing a result. This means a late-arriving stalled tick can neither overwrite dataStore with a stale payload NOR double-apply backoff on top of the fresh tick that already succeeded/failed. 3. CoinGecko axios timeout. `{ timeout: 10_000 }` passed to the HTTP call as defence-in-depth so the watchdog is rarely the primary failure signal; the HTTP layer will itself error out in 10s rather than hanging. Three new tests exercise the failure modes: - CoinGecko call is issued with a timeout option - A stalled tick that eventually resolves is discarded when a newer tick has already committed (stale-gen guard, success-path) - A late-rejecting stalled tick does NOT undo backoff nor touch dataStore (stale-gen guard, reject-path) - Watchdog reschedules the next tick even if the current one never resolves (fake-timer driven) Full backend suite: 868/868 passing. Made-with: Cursor
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6232489c76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const budgetResults = await Promise.allSettled( | ||
| projections.map((p) => | ||
| rpcServices(client.callRpc).getSuperblockBudget(p.block).call() | ||
| ) | ||
| ); |
There was a problem hiding this comment.
Remove projected budget RPCs from commit-critical await path
fetchProjectedSuperblocks now awaits Promise.allSettled(...) for all five projected getSuperblockBudget RPCs before tick() can commit. If any one of those calls hangs (the RPC layer here has no per-call timeout), the tick never reaches the atomic Object.assign(data, next) and gets watchdog-abandoned; if the same projection keeps hanging, every generation is abandoned and /mnstats stops refreshing despite successful market/chain/governance reads. Add per-call timeouts or decouple projected budget resolution from the main commit path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e14a3fe.
Each projected-sb budget RPC is now raced against SB_BUDGET_TIMEOUT_MS (5s) via a small withTimeout(promise, ms, label) helper. Within Promise.allSettled the race rejection lands in the existing "To be determined" fallback branch, so an individual stuck projection never blocks commit and the rest of the atomic payload still lands. The outer 60s watchdog stays as the universal safety net; this bounds the inner layer so the watchdog isn't the primary failure signal for budget hangs.
Covered by the new test "a single projected-sb budget RPC that hangs is bounded by SB_BUDGET_TIMEOUT_MS and does not prevent commit" — fake-timer driven, one projection is a never-resolving Promise, the others succeed, and we assert the commit lands with that sb's budget as "To be determined" and market/chain/gov/mn fields populated.
| hashOneDayAgo = await rpcServices(client.callRpc).getBlockHash(oneDayAgoBlock).call(); | ||
| break; | ||
| } catch { | ||
| data.budget = "To be determined"; | ||
| block--; | ||
| } |
There was a problem hiding this comment.
Preserve true chain head when retrying one-day-ago block hash
This retry loop decrements block on failures while repeatedly querying the same fixed oneDayAgoBlock, and the refactor now returns currentBlock: block later in fetchChainHead. A transient failure in this loop therefore lowers currentBlock even when the head hash was already found, which skews superBlockNextDate, superBlockNextEpochSec, and voting-deadline calculations for the committed snapshot. Use a separate retry variable (or avoid mutating block in this loop) so currentBlock remains accurate.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e14a3fe.
Good catch — this was an unintended semantic shift in the refactor. The pre-refactor code wrote data.currentBlock = block immediately after the tip-block lookup and before the one-day-ago retry loop that also decrements block, so the one-day-ago loop's decrements could never pollute currentBlock. Moving the return to the end of the function broke that invariant.
Now currentBlock is pinned into a dedicated const at the original pin-point, and the one-day-ago retry uses a separate cursor variable for its decrements. The returned currentBlock is guaranteed to equal the head height regardless of how many transient failures the one-day-ago loop absorbed. Business logic matches the pre-refactor original exactly.
Covered by the new test "currentBlock is pinned BEFORE the one-day-ago retry loop" — injects a transient failure on the one-day-ago getBlockHash call, asserts data.currentBlock stays at the tip and that the derived sb1..sb5 values match.
…ound 2) Two Codex findings on commit a5b5231: P1 — projected-sb budget RPCs were in the commit-critical await path. `fetchProjectedSuperblocks` awaits Promise.allSettled over five getSuperblockBudget(height) calls before `tick()` can reach Object.assign(data, next). The SyscoinRpcClient has no per-call socket timeout, so a single persistently-hanging projection would keep every tick from committing until the outer 60s watchdog fired and abandoned it — and since the hang is height-deterministic, every subsequent tick would hit the same wall and /mnstats would stop refreshing entirely. Fix: race each projected-sb budget RPC against SB_BUDGET_TIMEOUT_MS (5s) via a withTimeout helper. A stuck call rejects into the existing "To be determined" fallback path and the rest of the payload still commits atomically. The atomicity guarantee and the per-sb fallback semantics are both preserved. P2 — the refactor accidentally let one-day-ago retry decrements pollute currentBlock. The original pre-refactor code wrote data.currentBlock = block right after the tip-block lookup and BEFORE the one-day-ago getBlockHash retry loop that also decrements `block`, so any transient failure in that loop could never lower currentBlock in the committed snapshot. Returning `currentBlock: block` at the end of the function (as the refactor did) broke that invariant: a single retry in the one-day-ago loop would skew currentBlock and cascade into superBlockNextEpochSec / voting-deadline math. Fix: capture currentBlock into a dedicated const at the original pin-point, and use a separate `cursor` variable for the one-day-ago retry's decrements. This restores the original semantics exactly. Tests added: - "currentBlock is pinned BEFORE the one-day-ago retry loop" — forces a transient one-day-ago getBlockHash failure, asserts currentBlock + projected sb1..sb5 remain correct. - "a single projected-sb budget RPC that hangs is bounded by SB_BUDGET_TIMEOUT_MS and does not prevent commit" — uses fake timers to advance past the per-sb timeout while one projection is a never-resolving Promise, asserts commit lands with that sb's budget falling back to "To be determined" and the rest populated. Full backend suite: 870/870 passing across three consecutive runs. Made-with: Cursor
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e14a3fe058
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (tickTimer) return; | ||
| stopped = false; | ||
| // Kick off immediately; scheduleNext() will then use currentTickMs. | ||
| runAndReschedule(); |
There was a problem hiding this comment.
Make start() idempotent across immediate repeated calls
start() is guarded only by tickTimer, but runAndReschedule() is invoked immediately and does not set tickTimer until later, so two near-simultaneous start() calls can both pass the guard and spawn independent scheduler loops. In that scenario each loop runs its own watchdog/reschedule chain, causing duplicate RPC/CoinGecko traffic and racing updates/backoff state; this contradicts the module comment that start() is idempotent and can surface in tests or any runtime path that calls start() more than once.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7aee505.
Replaced the tickTimer-based guard (which as you correctly noted is not synchronously set — runAndReschedule only sets it after the in-flight tick finishes or the watchdog fires) with a dedicated started flag that's flipped before runAndReschedule() is invoked. stop() clears it so a restart is still possible, and __resetForTests() resets it for test isolation.
New regression test "start() is idempotent even across near-simultaneous calls (no duplicate scheduler loops — Codex round 3 P2)": pins the tick on an unresolved CoinGecko promise and fires start() three times back-to-back, asserting exactly one generation was allocated and exactly one axios.get was issued. Without the fix this test fails with tickGen === 3 and three parallel scheduler loops.
Full backend suite: 871/871 passing.
Codex round 3 P2 on commit e14a3fe: `start()` guarded on `tickTimer`, but `runAndReschedule()` only sets `tickTimer` after the in-flight tick finishes or the watchdog fires — so two `start()` calls in quick succession could both pass the guard and spawn independent scheduler loops, each with its own watchdog and backoff chain, doubling CoinGecko and RPC traffic and racing dataStore updates. Replaced the tickTimer-based guard with a synchronous `started` flag that is set before `runAndReschedule()` is invoked, so repeated `start()` calls are truly idempotent. `stop()` clears the flag so a restart is still possible. `__resetForTests()` resets it too. New test `"start() is idempotent even across near-simultaneous calls"` pins the tick on an unresolved CoinGecko promise and then fires three `start()` calls back-to-back, asserting exactly one generation was allocated and exactly one axios call was issued. Without the fix this test fails with tickGen=3 and three parallel scheduler loops. Full backend suite: 871/871 passing. Made-with: Cursor
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7aee50502e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (rescheduled || stopped) return; | ||
| rescheduled = true; | ||
| scheduleNext(); | ||
| } |
There was a problem hiding this comment.
Prevent pre-stop loops from rescheduling after restart
If stop() is called while a tick is still in flight and start() is called again before that tick settles, the older run loop can still execute this rescheduleOnce() closure. Because the guard only checks the shared stopped flag, a restart flips stopped back to false and lets the pre-stop loop schedule an extra timer, creating duplicate scheduler loops and duplicated CoinGecko/RPC traffic. This should be tied to a per-run token (or explicit watchdog/finally cancellation) so callbacks created before stop() cannot reschedule after a restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 903dbb3.
Added a per-run-loop token (activeRunToken) captured at the top of each runAndReschedule() invocation and cleared by stop(). Both rescheduleOnce() and the watchdog now short-circuit when activeRunToken !== myToken, so callbacks belonging to a run loop that was stopped (and possibly restarted) cannot schedule timers on top of the new loop, and the stale watchdog cannot mutate shared backoff state for a defunct run.
Regression test asserts that when stop() is called while the first tick's CoinGecko fetch is still hung and start() is called again (with the second tick committing normally), releasing the stale fetch afterwards does not add a second scheduler loop — advancing BASE_TICK_MS yields exactly one new tick generation and one new CoinGecko call (pre-fix would yield two of each).
…odex round 4)
Codex round 4 P2: if stop() was called while a tick was still in flight
and start() was called again before the tick settled, the pre-stop run
loop's finally{} (and its still-armed watchdog) would observe the
restored stopped=false flag and call scheduleNext() on top of the fresh
run loop, silently spawning duplicate scheduler loops (and duplicate
CoinGecko / RPC traffic).
Attach a per-runAndReschedule() token (activeRunToken, refreshed on each
invocation and cleared by stop()). rescheduleOnce() and the watchdog now
compare against this token and no-op when they belong to a defunct run
loop, while the new loop owns scheduling.
Added regression test: drive stop() with a hung gecko fetch, restart,
settle the stale fetch, advance BASE_TICK_MS, assert exactly one tick
generation advances (pre-fix would advance by 2).
Made-with: Cursor
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Why
On the live server,
/api/mnstatscurrently returns zeros for every field —mn_enabled: \"0\",next_superblock: \"SB0 - 0\",superblock_next_epoch_sec: 0,current_block: \"0\",price_usd: \"0.0000\"— even thoughsyscoindis healthy and reachable.Root cause is in
services/sysMain.js:dataStoreas it resolves.try/catch.curl https://api.coingecko.com/api/v3/coins/syscoinfrom the server returns HTTP 429 instantly).awaitthrows, the catch logs, and every subsequent RPC is skipped. On a cold start nothing has ever populated, sodataStorestays at its initialised defaults — and the frontend's new window-derivation flow (PR feat(mnStats): expose superblock_next_epoch_sec #13, PR Reminder dispatcher: key on next superblock, not end_epoch #14 consumers) can never arm becausesuperblock_next_epoch_secis permanently 0.setIntervalkeeps retrying at a fixed 20s through the 429 window, which doesn't help CoinGecko's rate-limit bucket drain.What
This PR restructures the tick without changing any business logic:
Business logic is identical: same RPC calls, same order, same fields, same formulas, same `/mnstats` response shape. What changes is strictly the error-boundary and commit-point semantics.
Tests
New `services/sysMain.test.js` — 8 tests covering:
Full backend suite: 864/864 passing locally.
Test plan
🤖 @codex review
Made with Cursor