fix(db-part-1): eliminate pool self-deadlock from nested checkouts inside transactions#4975
Conversation
…side transactions
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryHigh Risk Overview
Refactors across sim: Enterprise auto-add entitlement and billing context for Docs: Team/Enterprise workspace limits now state org-owned shared workspaces (Owners/Admins create; Members cannot) and clarify Enterprise vs Team seat behavior on invites. Reviewed by Cursor Bugbot for commit a1ddd02. Configure here. |
Greptile SummaryThis PR eliminates 10 identified paths where code inside
Confidence Score: 4/5Safe to merge. The core tripwire is well-tested, all 10 identified deadlock paths are addressed, and the design tradeoffs are clearly documented. The structural changes are large but the individual diffs are mechanical. The most complex new piece — tx-tripwire.ts — is thoroughly exercised by its own test suite. The duplicate.ts authorization bypass when tx is provided relies on a documentation-only contract; the MCP tool sync loads workflow state outside the transaction leaving a narrow stale-state window; and the invitation enterprise-plan loop makes serial round-trips that could be parallelised. apps/sim/lib/workflows/persistence/duplicate.ts (auth bypass via tx), apps/sim/lib/mcp/workflow-mcp-sync.ts (pre-tx state load), and apps/sim/lib/invitations/core.ts (serial entitlement lookups) deserve a second read. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant DB Pool
participant TX Connection
participant External
note over Caller,External: Before this PR (deadlock-prone path)
Caller->>DB Pool: db.transaction(callback)
DB Pool->>TX Connection: checkout connection A
TX Connection->>DB Pool: global pool query inside callback
DB Pool--xTX Connection: waits for connection B (pool saturated, deadlock)
note over Caller,External: After this PR (fixed paths)
Caller->>External: 1. hoist: resolve billing context / generate embedding / check entitlement
External-->>Caller: result
Caller->>DB Pool: 2. db.transaction(callback)
DB Pool->>TX Connection: checkout connection A
TX Connection->>TX Connection: all queries via tx handle (same connection)
TX Connection-->>DB Pool: commit/rollback, release A
Caller->>External: 3. post-commit: webhook sync / fire-and-forget
note over Caller,External: Tripwire (new)
Caller->>DB Pool: instrumentPoolClient wraps unsafe+begin
DB Pool->>DB Pool: AsyncLocalStorage marks tx context
DB Pool--xDB Pool: report if unsafe called while context active (throw/warn/off)
|
| if (updateData.content !== undefined && typeof updateData.content === 'string') { | ||
| return await db.transaction(async (tx) => { | ||
| // Get current chunk data for character count calculation and content comparison | ||
| const currentChunk = await tx | ||
| .select({ | ||
| documentId: embedding.documentId, | ||
| content: embedding.content, | ||
| contentLength: embedding.contentLength, | ||
| tokenCount: embedding.tokenCount, | ||
| }) | ||
| const content = updateData.content | ||
| const MAX_UPDATE_ATTEMPTS = 3 | ||
|
|
||
| for (let attempt = 1; attempt <= MAX_UPDATE_ATTEMPTS; attempt++) { | ||
| const [preRead] = await db | ||
| .select({ documentId: embedding.documentId, content: embedding.content }) | ||
| .from(embedding) | ||
| .where(eq(embedding.id, chunkId)) | ||
| .limit(1) | ||
|
|
||
| if (currentChunk.length === 0) { | ||
| if (!preRead) { | ||
| throw new Error(`Chunk ${chunkId} not found`) | ||
| } | ||
|
|
||
| const oldContentLength = currentChunk[0].contentLength | ||
| const oldTokenCount = currentChunk[0].tokenCount | ||
| const content = updateData.content! // We know it's defined from the if check above | ||
| const newContentLength = content.length | ||
|
|
||
| // Only regenerate embedding if content actually changed | ||
| if (content !== currentChunk[0].content) { | ||
| logger.info(`[${requestId}] Content changed, regenerating embedding for chunk ${chunkId}`) | ||
|
|
||
| const kbRow = await tx | ||
| // The embedding is a function of the new content alone, so generating it | ||
| // outside the transaction is always valid. | ||
| let regenerated: { embedding: number[]; tokenCount: number } | null = null | ||
| if (content !== preRead.content) { | ||
| const kbRow = await db | ||
| .select({ embeddingModel: knowledgeBase.embeddingModel }) | ||
| .from(knowledgeBase) | ||
| .innerJoin(document, eq(document.knowledgeBaseId, knowledgeBase.id)) | ||
| .where(eq(document.id, currentChunk[0].documentId)) | ||
| .where(eq(document.id, preRead.documentId)) | ||
| .limit(1) | ||
| const chunkEmbeddingModel = kbRow[0]?.embeddingModel | ||
| if (!chunkEmbeddingModel) { | ||
| throw new Error('Knowledge base for chunk not found') | ||
| } | ||
|
|
||
| logger.info(`[${requestId}] Content changed, regenerating embedding for chunk ${chunkId}`) | ||
| const { embeddings } = await generateEmbeddings([content], chunkEmbeddingModel, workspaceId) | ||
| regenerated = { | ||
| embedding: embeddings[0], | ||
| tokenCount: estimateTokenCount( | ||
| content, | ||
| getEmbeddingModelInfo(chunkEmbeddingModel).tokenizerProvider |
There was a problem hiding this comment.
Up to
MAX_UPDATE_ATTEMPTS embedding API calls per single user request
Each retry iteration may regenerate the embedding when content !== preRead.content. With MAX_UPDATE_ATTEMPTS = 3 and a highly-contested chunk, up to three external embedding API calls can be made for one user-initiated update. Each call generates tokens and incurs cost. Consider caching the embedding result across retries: if content hasn't changed between retries, reuse the previously generated regenerated value rather than discarding it.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
known tradeoff
|
bugbot run |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a1ddd02. Configure here.
…row-deletes Migration renumbered 0232 -> 0233 (staging took 0232 for BYOK keys); snapshot regenerated, hand-written SQL preserved, zero drift. checkUniqueConstraintsDb reconciles staging's executor param (pool self-deadlock fix #4975) with the tenant-bounded planner flag: own transaction only when given plain db, SET LOCAL on the caller's transaction otherwise. process-contents test keeps relying on global mocks (now incl. dbReplica). Route baseline 815 (+2 staging tools). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Summary
Fix 10 paths where code inside db.transaction callbacks queried the global
postgres-js pool instead of the tx handle — at saturation, every held
connection waits on a second checkout and the pool deadlocks silently
Thread the tx executor where reads need transaction consistency (table
upsert uniqueness, deploy validation, credential-ID migration, name dedup)
Hoist independent work pre-tx (auth checks, billing context, enterprise
entitlement, embedding generation) and move credential-set webhook sync
post-commit so external HTTP never runs on a held connection
Add a runtime tripwire in @sim/db: AsyncLocalStorage-instrumented client
detects any global-pool query inside a tx callback at any call depth —
throws in dev/test, rate-limit-logs in prod (DB_TX_TRIPWIRE to override),
with runOutsideTransactionContext() as the deliberate escape hatch
Type of Change
Testing
Tested manually
Checklist