fix(server): invoke plugin shutdown hooks and close connections during graceful shutdown#441
Open
MarioCadenas wants to merge 3 commits into
Open
fix(server): invoke plugin shutdown hooks and close connections during graceful shutdown#441MarioCadenas wants to merge 3 commits into
MarioCadenas wants to merge 3 commits into
Conversation
…g graceful shutdown Plugin shutdown() implementations (analytics, genie, files, agents, serving, vector-search) were dead code: nothing in src ever called them, so the files plugin's 10s in-flight-write drain never ran. The declared "shutdown" lifecycle event was likewise never emitted. And because server.close(cb) never completes while keep-alive/SSE sockets exist, any connected browser forced every shutdown to burn the full 15s timeout and exit(1), which orchestrators record as a crash on routine deploys. Changes: - Declare shutdown?(): Promise<void> | void on BasePlugin (shared) so the contract is typed. - _gracefulShutdown now runs every plugin's shutdown() concurrently with a 10s per-plugin timeout (errors logged, never thrown), then emits the "shutdown" lifecycle event via PluginContext. - Call server.closeIdleConnections() right after server.close() and server.closeAllConnections() after plugin hooks complete so close() can finish instead of hanging on keep-alive/SSE sockets. - Switch signal handlers to process.once plus an isShuttingDown guard to prevent re-entrant shutdown. - Exit 0 when the shutdown deadline forces exit (deliberate shutdown, not a crash); exit 1 only on unexpected shutdown errors. - Make TelemetryManager.shutdown() public and idempotent, and flush it inside the orchestrated shutdown instead of racing the competing SIGTERM handler against process.exit. The 15s overall budget is unchanged; no new config knobs. Co-authored-by: Isaac Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
…hutdown phases - Move Lakebase SP/OBO pool teardown out of abortActiveOperations() (shutdown phase 1) into an awaited shutdown() hook (phase 3) so other plugins' shutdown hooks can still drain state through the database. - Close the CacheManager's storage during shutdown; its persistent Lakebase pool was previously never closed. - Bound the shutdown lifecycle emit and the telemetry flush with a 2s per-phase timeout so the worst case (10s plugin hooks + 2s + 2s) stays inside the 15s force-exit budget; document the arithmetic. - Attach a no-op rejection handler to the racing branch in the shared timeout helper so a hook that rejects after its timeout already won cannot surface as an unhandledRejection. - Make the server the single owner of the telemetry flush: it calls TelemetryManager.disownSignalHandlers() at start so the standalone SIGTERM/SIGINT handlers (kept for server-less usage) cannot start the flush before plugin hooks have run. - Keep exit code 0 on the force-exit path (deploy safety) but log at error level with the phase in flight; record the tradeoff in a comment. - Document the shutdown lifecycle event semantics, the synchronous isShuttingDown guard, and cancellation-only abortActiveOperations. - Tests: phase ordering, hanging telemetry flush, late rejection suppression, and lakebase pool teardown via shutdown(). Co-authored-by: Isaac Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
…y with telemetry flush - Bound the cache storage close with raceWithTimeout(PHASE_SHUTDOWN_TIMEOUT_MS) so a stuck pool drain cannot blow the graceful budget and hit the 15s force-exit; close failures/timeouts are now logged instead of silently swallowed (a never-initialized cache stays a silent no-op) - Run the cache close concurrently with the telemetry flush (they are independent), keeping the worst case at 10s + 2s + max(2s, 2s) = 14s - Update the shutdown budget doc comment with the real phase arithmetic and document that the serverClosed await is unbounded by design (it follows closeAllConnections() and the force-exit timer is the backstop) - Tests: pin the cache-close phase in the ordering test (after closeAll, before exit, unordered relative to the concurrent flush) and add a hanging-cache-close test mirroring the hanging-flush one Co-authored-by: Isaac Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (reliability finding R-4, High)
Graceful shutdown was broken in three independent ways:
shutdown()was dead code. Six plugins (analytics, genie, files, agents, serving, vector-search) implementshutdown(), but nothing insrcever called it —_gracefulShutdown()only invokedabortActiveOperations(). Most notably, the files plugin's 10-second in-flight-write drain never ran, so writes could be killed mid-flight on every deploy. The"shutdown"lifecycle event declared inPluginContextwas likewise never emitted, andBasePlugindidn't even declare the method.server.close(cb)never completes while keep-alive/SSE sockets exist, so the close callback (the onlyexit(0)path) never fired and the force-shutdown timer calledprocess.exit(1)— orchestrators record a crash on routine deploys.process.on(notonce), allowing re-entrant shutdown, andTelemetryManagerhad its own competing SIGTERM flush that raced the server'sprocess.exit.Changes
packages/shared/src/plugin.ts— declareshutdown?(): Promise<void> | voidonBasePluginso the contract is typed.packages/appkit/src/plugins/server/index.ts— rework_gracefulShutdown():shutdown()concurrently, each bounded by a 10s per-plugin timeout (sized to cover the files plugin's 10s write drain; errors and timeouts are logged, never thrown)."shutdown"lifecycle event viaPluginContext.emitLifecycle().server.closeIdleConnections()right afterserver.close(), andserver.closeAllConnections()once plugin hooks finish, soclose()can actually complete (both exist on Node ≥18.2http.Server).process.onceand add anisShuttingDownguard.packages/appkit/src/telemetry/telemetry-manager.ts— makeTelemetryManager.shutdown()public and idempotent (SDK reference cleared synchronously; repeated/concurrent calls share one flush). The server now awaits the telemetry flush inside the orchestrated shutdown instead of racing the competing SIGTERM handler againstprocess.exit. The existingprocess.oncehandler remains as a fallback for apps without the server plugin and is now a safe no-op when the server flushes first.packages/appkit/src/core/plugin-context.ts— doc update:emitLifecycleis called by core (setup:complete) and the server plugin (shutdown).The overall 15s budget is unchanged and no new config knobs were added.
Tests
Extended
packages/appkit/src/plugins/server/tests/server.test.ts(_gracefulShutdownblock, following the existing mockedprocess.exitpattern):shutdown()hooks are called during graceful shutdown (and skipped for plugins without one)"shutdown"lifecycle event firescloseIdleConnections()/closeAllConnections()are called soclose()completesVerification
pnpm install✓pnpm build✓pnpm -r typecheck✓pnpm check:fix✓ (no diagnostics on changed files)This pull request and its description were written by Isaac.