Skip to content

fix(db-part-1): eliminate pool self-deadlock from nested checkouts inside transactions#4975

Merged
icecrasher321 merged 3 commits into
stagingfrom
improvement/db-pattern
Jun 12, 2026
Merged

fix(db-part-1): eliminate pool self-deadlock from nested checkouts inside transactions#4975
icecrasher321 merged 3 commits into
stagingfrom
improvement/db-pattern

Conversation

@icecrasher321

Copy link
Copy Markdown
Collaborator

Summary

  • Fix 10 paths where code inside db.transaction callbacks queried the global
    postgres-js pool instead of the tx handle — at saturation, every held
    connection waits on a second checkout and the pool deadlocks silently

  • Thread the tx executor where reads need transaction consistency (table
    upsert uniqueness, deploy validation, credential-ID migration, name dedup)

  • Hoist independent work pre-tx (auth checks, billing context, enterprise
    entitlement, embedding generation) and move credential-set webhook sync
    post-commit so external HTTP never runs on a held connection

  • Add a runtime tripwire in @sim/db: AsyncLocalStorage-instrumented client
    detects any global-pool query inside a tx callback at any call depth —
    throws in dev/test, rate-limit-logs in prod (DB_TX_TRIPWIRE to override),
    with runOutsideTransactionContext() as the deliberate escape hatch

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jun 12, 2026 12:15am

Request Review

@cursor

cursor Bot commented Jun 11, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Touches billing usage recording, invitations, credential-set webhooks, knowledge chunk updates, and deploy validation—critical paths where post-commit webhook sync or tripwire behavior could surface regressions under load.

Overview
This PR addresses Postgres connection-pool starvation when code inside db.transaction callbacks hits the global pool (billing lookups, OAuth/webhook HTTP, embedding APIs) instead of the transaction handle.

@sim/db tripwire: Primary and realtime pools are wrapped with instrumentPoolClient plus AsyncLocalStorage so global-pool queries inside a transaction callback are detected (throw in dev/CI, rate-limited logs in prod; DB_TX_TRIPWIRE override). runOutsideTransactionContext() is the escape hatch for intentional fire-and-forget work.

Refactors across sim: Enterprise auto-add entitlement and billing context for recordUsage are resolved before opening transactions. Credential-set webhook sync runs after commit with errors logged, not rolled back. Chunk updates generate embeddings outside the transaction with row-lock retries. MCP sync requires preloaded state when called with tx. Deploy validation, unique checks, workflow dedup, and credential migration thread the tx executor where reads must see uncommitted writes.

Docs: Team/Enterprise workspace limits now state org-owned shared workspaces (Owners/Admins create; Members cannot) and clarify Enterprise vs Team seat behavior on invites.

Reviewed by Cursor Bugbot for commit a1ddd02. Configure here.

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR eliminates 10 identified paths where code inside db.transaction() callbacks issued queries against the global postgres-js pool, creating a second pooled-connection checkout that can deadlock the pool at saturation. The fix adds a runtime AsyncLocalStorage-based tripwire in @sim/db (throws in dev/CI, rate-limited error log in prod) and threads tx handles or hoists independent work pre-transaction across billing, knowledge, webhook, workflow, MCP, and credential-set code paths.

  • tx-tripwire.ts: New instrumentPoolClient wraps unsafe and begin on postgres-js root clients; runOutsideTransactionContext provides a deliberate escape hatch for fire-and-forget global-pool work; fully tested with 10 unit tests covering all modes and the lazy-thenable edge case.
  • 10 deadlock paths fixed: Billing context resolution, enterprise-plan entitlement, embedding generation, credential-ID migration, deploy validation, name deduplication, and uniqueness checks are all either hoisted pre-transaction or threaded with the tx executor.
  • Credential-set webhook sync: Moved post-commit (no longer atomic with the membership write) to keep external HTTP off a held connection; sync failures are caught and logged with documented eventual-consistency semantics.

Confidence Score: 4/5

Safe to merge. The core tripwire is well-tested, all 10 identified deadlock paths are addressed, and the design tradeoffs are clearly documented.

The structural changes are large but the individual diffs are mechanical. The most complex new piece — tx-tripwire.ts — is thoroughly exercised by its own test suite. The duplicate.ts authorization bypass when tx is provided relies on a documentation-only contract; the MCP tool sync loads workflow state outside the transaction leaving a narrow stale-state window; and the invitation enterprise-plan loop makes serial round-trips that could be parallelised.

apps/sim/lib/workflows/persistence/duplicate.ts (auth bypass via tx), apps/sim/lib/mcp/workflow-mcp-sync.ts (pre-tx state load), and apps/sim/lib/invitations/core.ts (serial entitlement lookups) deserve a second read.

Important Files Changed

Filename Overview
packages/db/tx-tripwire.ts New AsyncLocalStorage-based tripwire that wraps unsafe and begin on postgres-js root clients to detect nested pool checkouts inside transaction callbacks; well-tested with 10 unit tests covering throw/warn/off modes, nesting, and the runOutsideTransactionContext escape hatch.
packages/db/db.ts Applies instrumentPoolClient to both db and dbReplica at initialisation; straightforward and correct.
apps/sim/lib/knowledge/chunks/service.ts Content updates restructured to hoist embedding generation pre-transaction with a FOR UPDATE retry loop; logic is correct but can call the embedding API up to 3 times for a highly-contested chunk.
apps/sim/lib/workflows/persistence/duplicate.ts Authorization moved pre-transaction to avoid global-pool checkouts; when tx is provided the auth block is skipped entirely with only a documentation contract — current callers are correct but the bypass is invisible to type-checking.
apps/sim/lib/mcp/workflow-mcp-sync.ts Workflow state loading moved outside the transaction; discriminated-union SyncOptions correctly enforces state when tx is present, but the pre-tx load introduces a stale-state window for callers without a pinned version check.
apps/sim/lib/invitations/core.ts Enterprise-plan entitlement pre-resolved per workspace before transaction; entitlement lookups are sequential (for...of await) rather than parallel — safe but slower for multi-workspace invitations.
apps/sim/lib/logs/execution/logger.ts Billing context resolved before the advisory-locked transaction; new discriminated union in RecordUsageParams enforces that billingEntity/billingPeriod are present when tx is passed, eliminating the previous unguarded spread of billingContext ?? {}.
apps/sim/lib/billing/core/usage-log.ts Type refined to a discriminated union requiring billingEntity/billingPeriod whenever tx is supplied; correctly encodes the pre-transaction resolution contract at the type level.
apps/sim/app/api/credential-sets/memberships/route.ts Webhook sync moved post-commit with isolated error handling; intentional eventual-consistency tradeoff documented inline.
apps/sim/lib/table/validation.ts checkUniqueConstraintsDb now accepts an optional executor (defaulting to db) so uniqueness checks inside an upsertRow transaction observe the transaction's own uncommitted rows.
apps/sim/lib/webhooks/deploy.ts validateTriggerWebhookConfigForDeploy and credentialSetHasProviderCredential accept an executor: DbOrTx so deploy validation can run on the open transaction connection.
packages/db/tx-tripwire.test.ts Comprehensive test suite covering throw/warn/off mode detection, nested transactions, the lazy-thenable escape hatch, and deduplication; well-structured with a reusable fake client.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant DB Pool
    participant TX Connection
    participant External

    note over Caller,External: Before this PR (deadlock-prone path)
    Caller->>DB Pool: db.transaction(callback)
    DB Pool->>TX Connection: checkout connection A
    TX Connection->>DB Pool: global pool query inside callback
    DB Pool--xTX Connection: waits for connection B (pool saturated, deadlock)

    note over Caller,External: After this PR (fixed paths)
    Caller->>External: 1. hoist: resolve billing context / generate embedding / check entitlement
    External-->>Caller: result
    Caller->>DB Pool: 2. db.transaction(callback)
    DB Pool->>TX Connection: checkout connection A
    TX Connection->>TX Connection: all queries via tx handle (same connection)
    TX Connection-->>DB Pool: commit/rollback, release A
    Caller->>External: 3. post-commit: webhook sync / fire-and-forget

    note over Caller,External: Tripwire (new)
    Caller->>DB Pool: instrumentPoolClient wraps unsafe+begin
    DB Pool->>DB Pool: AsyncLocalStorage marks tx context
    DB Pool--xDB Pool: report if unsafe called while context active (throw/warn/off)
Loading

Comments Outside Diff (1)

  1. apps/sim/lib/mcp/workflow-mcp-sync.ts, line 174-194 (link)

    P2 Stale-state window between pre-tx load and transaction open

    loadDeployedWorkflowState is now called outside the transaction. If the workflow is redeployed or undeployed between this call and the db.transaction(...) that starts on line 189, the MCP tools will be synced with the stale (old) state snapshot. syncMcpToolsIfStillActive in deployment-outbox.ts mitigates this via a version-ID check inside the transaction, but callers that go through the general syncMcpToolsForWorkflow path (without a pinned deploymentVersionId) have no equivalent guard.

Reviews (1): Last reviewed commit: "fix(db-part-1): eliminate pool self-dead..." | Re-trigger Greptile

Comment thread apps/sim/lib/invitations/core.ts
Comment thread apps/sim/lib/invitations/core.ts
Comment thread apps/sim/lib/workflows/persistence/duplicate.ts
Comment on lines 344 to +380
if (updateData.content !== undefined && typeof updateData.content === 'string') {
return await db.transaction(async (tx) => {
// Get current chunk data for character count calculation and content comparison
const currentChunk = await tx
.select({
documentId: embedding.documentId,
content: embedding.content,
contentLength: embedding.contentLength,
tokenCount: embedding.tokenCount,
})
const content = updateData.content
const MAX_UPDATE_ATTEMPTS = 3

for (let attempt = 1; attempt <= MAX_UPDATE_ATTEMPTS; attempt++) {
const [preRead] = await db
.select({ documentId: embedding.documentId, content: embedding.content })
.from(embedding)
.where(eq(embedding.id, chunkId))
.limit(1)

if (currentChunk.length === 0) {
if (!preRead) {
throw new Error(`Chunk ${chunkId} not found`)
}

const oldContentLength = currentChunk[0].contentLength
const oldTokenCount = currentChunk[0].tokenCount
const content = updateData.content! // We know it's defined from the if check above
const newContentLength = content.length

// Only regenerate embedding if content actually changed
if (content !== currentChunk[0].content) {
logger.info(`[${requestId}] Content changed, regenerating embedding for chunk ${chunkId}`)

const kbRow = await tx
// The embedding is a function of the new content alone, so generating it
// outside the transaction is always valid.
let regenerated: { embedding: number[]; tokenCount: number } | null = null
if (content !== preRead.content) {
const kbRow = await db
.select({ embeddingModel: knowledgeBase.embeddingModel })
.from(knowledgeBase)
.innerJoin(document, eq(document.knowledgeBaseId, knowledgeBase.id))
.where(eq(document.id, currentChunk[0].documentId))
.where(eq(document.id, preRead.documentId))
.limit(1)
const chunkEmbeddingModel = kbRow[0]?.embeddingModel
if (!chunkEmbeddingModel) {
throw new Error('Knowledge base for chunk not found')
}

logger.info(`[${requestId}] Content changed, regenerating embedding for chunk ${chunkId}`)
const { embeddings } = await generateEmbeddings([content], chunkEmbeddingModel, workspaceId)
regenerated = {
embedding: embeddings[0],
tokenCount: estimateTokenCount(
content,
getEmbeddingModelInfo(chunkEmbeddingModel).tokenizerProvider

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Up to MAX_UPDATE_ATTEMPTS embedding API calls per single user request

Each retry iteration may regenerate the embedding when content !== preRead.content. With MAX_UPDATE_ATTEMPTS = 3 and a highly-contested chunk, up to three external embedding API calls can be made for one user-initiated update. Each call generates tokens and incurs cost. Consider caching the embedding result across retries: if content hasn't changed between retries, reuse the previously generated regenerated value rather than discarding it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

known tradeoff

@icecrasher321

Copy link
Copy Markdown
Collaborator Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a1ddd02. Configure here.

Comment thread apps/sim/lib/knowledge/chunks/service.ts
@icecrasher321 icecrasher321 merged commit f7b40fe into staging Jun 12, 2026
15 checks passed
TheodoreSpeaks added a commit that referenced this pull request Jun 12, 2026
…row-deletes

Migration renumbered 0232 -> 0233 (staging took 0232 for BYOK keys);
snapshot regenerated, hand-written SQL preserved, zero drift.
checkUniqueConstraintsDb reconciles staging's executor param (pool
self-deadlock fix #4975) with the tenant-bounded planner flag: own
transaction only when given plain db, SET LOCAL on the caller's
transaction otherwise. process-contents test keeps relying on global
mocks (now incl. dbReplica). Route baseline 815 (+2 staging tools).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@waleedlatif1 waleedlatif1 deleted the improvement/db-pattern branch June 12, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant