Skip to content

fix(server): invoke plugin shutdown hooks and close connections during graceful shutdown#441

Open
MarioCadenas wants to merge 3 commits into
mainfrom
fix/plugin-shutdown-lifecycle
Open

fix(server): invoke plugin shutdown hooks and close connections during graceful shutdown#441
MarioCadenas wants to merge 3 commits into
mainfrom
fix/plugin-shutdown-lifecycle

Conversation

@MarioCadenas

Copy link
Copy Markdown
Collaborator

Problem (reliability finding R-4, High)

Graceful shutdown was broken in three independent ways:

  1. Plugin shutdown() was dead code. Six plugins (analytics, genie, files, agents, serving, vector-search) implement shutdown(), but nothing in src ever called it — _gracefulShutdown() only invoked abortActiveOperations(). Most notably, the files plugin's 10-second in-flight-write drain never ran, so writes could be killed mid-flight on every deploy. The "shutdown" lifecycle event declared in PluginContext was likewise never emitted, and BasePlugin didn't even declare the method.
  2. Every shutdown with a connected browser burned the full 15s timeout and exited 1. server.close(cb) never completes while keep-alive/SSE sockets exist, so the close callback (the only exit(0) path) never fired and the force-shutdown timer called process.exit(1) — orchestrators record a crash on routine deploys.
  3. Races and re-entrancy. Signal handlers used process.on (not once), allowing re-entrant shutdown, and TelemetryManager had its own competing SIGTERM flush that raced the server's process.exit.

Changes

  • packages/shared/src/plugin.ts — declare shutdown?(): Promise<void> | void on BasePlugin so the contract is typed.
  • packages/appkit/src/plugins/server/index.ts — rework _gracefulShutdown():
    • After aborting active operations, run every plugin's shutdown() concurrently, each bounded by a 10s per-plugin timeout (sized to cover the files plugin's 10s write drain; errors and timeouts are logged, never thrown).
    • Emit the "shutdown" lifecycle event via PluginContext.emitLifecycle().
    • Call server.closeIdleConnections() right after server.close(), and server.closeAllConnections() once plugin hooks finish, so close() can actually complete (both exist on Node ≥18.2 http.Server).
    • Switch signal handlers to process.once and add an isShuttingDown guard.
    • The 15s force-exit now exits 0 (it's a deliberate shutdown, not a crash); exit 1 is reserved for unexpected errors thrown during shutdown.
  • packages/appkit/src/telemetry/telemetry-manager.ts — make TelemetryManager.shutdown() public and idempotent (SDK reference cleared synchronously; repeated/concurrent calls share one flush). The server now awaits the telemetry flush inside the orchestrated shutdown instead of racing the competing SIGTERM handler against process.exit. The existing process.once handler remains as a fallback for apps without the server plugin and is now a safe no-op when the server flushes first.
  • packages/appkit/src/core/plugin-context.ts — doc update: emitLifecycle is called by core (setup:complete) and the server plugin (shutdown).

The overall 15s budget is unchanged and no new config knobs were added.

Tests

Extended packages/appkit/src/plugins/server/tests/server.test.ts (_gracefulShutdown block, following the existing mocked process.exit pattern):

  • plugin shutdown() hooks are called during graceful shutdown (and skipped for plugins without one)
  • a failing hook is logged and doesn't abort shutdown
  • a hanging hook doesn't block past its 10s timeout (fake timers)
  • the "shutdown" lifecycle event fires
  • closeIdleConnections()/closeAllConnections() are called so close() completes
  • shutdown is not re-entrant (second signal is a no-op)
  • exits 0 with and without a server instance

Verification

  • pnpm install
  • pnpm build
  • pnpm -r typecheck
  • pnpm check:fix ✓ (no diagnostics on changed files)
  • Full appkit suite: 119 files, 2146 tests passed (includes server, telemetry, plugin-context, files/serving/vector-search shutdown tests)

This pull request and its description were written by Isaac.

…g graceful shutdown

Plugin shutdown() implementations (analytics, genie, files, agents,
serving, vector-search) were dead code: nothing in src ever called them,
so the files plugin's 10s in-flight-write drain never ran. The declared
"shutdown" lifecycle event was likewise never emitted. And because
server.close(cb) never completes while keep-alive/SSE sockets exist,
any connected browser forced every shutdown to burn the full 15s
timeout and exit(1), which orchestrators record as a crash on routine
deploys.

Changes:
- Declare shutdown?(): Promise<void> | void on BasePlugin (shared) so
  the contract is typed.
- _gracefulShutdown now runs every plugin's shutdown() concurrently
  with a 10s per-plugin timeout (errors logged, never thrown), then
  emits the "shutdown" lifecycle event via PluginContext.
- Call server.closeIdleConnections() right after server.close() and
  server.closeAllConnections() after plugin hooks complete so close()
  can finish instead of hanging on keep-alive/SSE sockets.
- Switch signal handlers to process.once plus an isShuttingDown guard
  to prevent re-entrant shutdown.
- Exit 0 when the shutdown deadline forces exit (deliberate shutdown,
  not a crash); exit 1 only on unexpected shutdown errors.
- Make TelemetryManager.shutdown() public and idempotent, and flush it
  inside the orchestrated shutdown instead of racing the competing
  SIGTERM handler against process.exit.

The 15s overall budget is unchanged; no new config knobs.

Co-authored-by: Isaac
Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
@MarioCadenas MarioCadenas requested a review from a team as a code owner June 11, 2026 16:30
…hutdown phases

- Move Lakebase SP/OBO pool teardown out of abortActiveOperations()
  (shutdown phase 1) into an awaited shutdown() hook (phase 3) so other
  plugins' shutdown hooks can still drain state through the database.
- Close the CacheManager's storage during shutdown; its persistent
  Lakebase pool was previously never closed.
- Bound the shutdown lifecycle emit and the telemetry flush with a 2s
  per-phase timeout so the worst case (10s plugin hooks + 2s + 2s)
  stays inside the 15s force-exit budget; document the arithmetic.
- Attach a no-op rejection handler to the racing branch in the shared
  timeout helper so a hook that rejects after its timeout already won
  cannot surface as an unhandledRejection.
- Make the server the single owner of the telemetry flush: it calls
  TelemetryManager.disownSignalHandlers() at start so the standalone
  SIGTERM/SIGINT handlers (kept for server-less usage) cannot start
  the flush before plugin hooks have run.
- Keep exit code 0 on the force-exit path (deploy safety) but log at
  error level with the phase in flight; record the tradeoff in a comment.
- Document the shutdown lifecycle event semantics, the synchronous
  isShuttingDown guard, and cancellation-only abortActiveOperations.
- Tests: phase ordering, hanging telemetry flush, late rejection
  suppression, and lakebase pool teardown via shutdown().

Co-authored-by: Isaac
Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
…y with telemetry flush

- Bound the cache storage close with raceWithTimeout(PHASE_SHUTDOWN_TIMEOUT_MS)
  so a stuck pool drain cannot blow the graceful budget and hit the 15s
  force-exit; close failures/timeouts are now logged instead of silently
  swallowed (a never-initialized cache stays a silent no-op)
- Run the cache close concurrently with the telemetry flush (they are
  independent), keeping the worst case at 10s + 2s + max(2s, 2s) = 14s
- Update the shutdown budget doc comment with the real phase arithmetic and
  document that the serverClosed await is unbounded by design (it follows
  closeAllConnections() and the force-exit timer is the backstop)
- Tests: pin the cache-close phase in the ordering test (after closeAll,
  before exit, unordered relative to the concurrent flush) and add a
  hanging-cache-close test mirroring the hanging-flush one

Co-authored-by: Isaac
Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant