Skip to content

improvement(knowledge): batch trigger dispatch, prune redundant DB roundtrips#4680

Merged
waleedlatif1 merged 5 commits into
stagingfrom
waleedlatif1/connector-pipeline-audit
May 20, 2026
Merged

improvement(knowledge): batch trigger dispatch, prune redundant DB roundtrips#4680
waleedlatif1 merged 5 commits into
stagingfrom
waleedlatif1/connector-pipeline-audit

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • processDocumentsWithQueue now uses tasks.batchTrigger when Trigger.dev is available — collapses N HTTP roundtrips to ceil(N / 1000) per dispatch (a 100-doc connector sync drops from 100 calls to 1).
  • Adds idempotency keys (doc-process-${documentId}-${requestId}) so retried dispatches don't double-enqueue.
  • Per-batch isConnectorDeleted + isKnowledgeBaseDeleted checks merged into one checkSyncLiveness JOIN — one SELECT instead of two per sync batch.
  • Dropped pre-upload isKnowledgeBaseDeleted checks from addDocument / updateDocument: the batch-boundary liveness check already catches pre-batch deletions, and the in-tx SELECT … FOR UPDATE is the authoritative point for mid-batch races.
  • Deleted dead processDocumentsWithTrigger helper.

Type of Change

  • Performance improvement

Testing

  • All 208 KB unit tests pass (lib/knowledge/**, app/api/knowledge/**).
  • No behavioral change in direct (non-Trigger.dev) mode — that path still uses per-doc parallel dispatch via dispatchDocumentProcessingJob.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…undtrips

Connector sync was dispatching Trigger.dev document-processing jobs one
HTTP roundtrip at a time. processDocumentsWithQueue now uses
tasks.batchTrigger when Trigger.dev is available, collapsing N roundtrips
to ceil(N/1000). Idempotency keys protect against duplicate runs on retry.

Also trims DB roundtrips inside the sync loop:
- Per-batch isConnectorDeleted + isKnowledgeBaseDeleted collapsed into a
  single checkSyncLiveness JOIN (one SELECT instead of two per batch).
- Dropped redundant pre-upload isKnowledgeBaseDeleted checks from
  addDocument/updateDocument: the batch-boundary liveness check already
  catches pre-batch deletions and the in-tx FOR UPDATE is authoritative
  for races during the batch.
- Removed dead processDocumentsWithTrigger helper (never called).
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 20, 2026 8:31pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 20, 2026

PR Summary

Medium Risk
Moderate risk because it changes background job dispatch semantics (batching, idempotency, partial-failure handling) and alters sync-time deletion checks, which could affect processing reliability under failure or deletion races.

Overview
Reduces connector sync DB roundtrips and improves processing dispatch throughput. The per-batch isConnectorDeleted + isKnowledgeBaseDeleted checks are replaced with a single checkSyncLiveness JOIN query, and sync now uses that combined result at batch boundaries and before stuck-doc retries.

Batches Trigger.dev processing enqueues. processDocumentsWithQueue now uses tasks.batchTrigger (chunked to 1,000) with per-document idempotency keys and logs partial failures, falling back to in-process execution when Trigger.dev isn’t available; the older per-doc dispatch helper and the unused processDocumentsWithTrigger are removed.

Prunes redundant pre-upload KB deletion checks. addDocument/updateDocument no longer do a separate pre-check for deleted knowledge bases, relying on the transactional FOR UPDATE liveness check instead.

Reviewed by Cursor Bugbot for commit a2cf183. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR tightens the document-processing dispatch pipeline in two ways: it collapses per-document Trigger.dev tasks.trigger calls into chunked tasks.batchTrigger calls (cutting N HTTP roundtrips to ceil(N/1000)), and it merges the two per-batch DB liveness probes into a single JOIN query.

  • dispatchViaBatchTrigger replaces the old per-doc tasks.trigger fan-out with tasks.batchTrigger, adds idempotency keys scoped to (documentId, requestId), and collects batchIds for dashboard traceability.
  • checkSyncLiveness unifies the isConnectorDeleted + isKnowledgeBaseDeleted round-trips into one INNER JOIN query; both per-loop and post-loop call sites are updated to use the combined result.
  • Pre-upload isKnowledgeBaseDeleted guards in addDocument/updateDocument are dropped; the transactional SELECT … FOR UPDATE in isKnowledgeBaseActiveInTx remains and is the authoritative mid-batch guard.

Confidence Score: 5/5

Safe to merge — the dispatch and liveness-check refactors are logically equivalent to the paths they replace, and the transactional FOR UPDATE guard that matters for correctness is untouched.

Both changed code paths (batch dispatch and liveness JOIN) preserve the observable contract: the same exceptions are thrown on deletion, the same documents land in Trigger.dev, and the direct-mode fallback calls the identical processDocumentAsync signature. The idempotency keys and typed batchTrigger generics are additive improvements with no downside risk.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/documents/service.ts Replaces per-doc trigger fan-out with chunked batchTrigger; adds typed generics, idempotency keys, and batchId logging. Direct (non-Trigger) path refactored into dispatchInProcess with identical semantics to the old code.
apps/sim/lib/knowledge/connectors/sync-engine.ts Merges isConnectorDeleted + isKnowledgeBaseDeleted into a single JOIN query (checkSyncLiveness). Removes redundant pre-upload KB checks; transactional FOR UPDATE guard in addDocument/updateDocument is retained.

Reviews (2): Last reviewed commit: "improvement(knowledge): thread requestId..." | Re-trigger Greptile

Comment thread apps/sim/lib/knowledge/documents/service.ts Outdated
Comment thread apps/sim/lib/knowledge/documents/service.ts Outdated
- Use the canonical DocumentProcessingPayload from the task module instead
  of the duplicate DocumentJobData interface in service.ts
- Pass typeof processDocumentTask as a generic to tasks.batchTrigger so the
  payload shape is type-checked against the task definition
- Inline TRIGGER_BATCH_SIZE provenance (Trigger.dev SDK 4.3.1+ doc'd cap,
  we're on 4.4.3)
- Split direct vs trigger dispatch into dispatchInProcess and
  dispatchViaBatchTrigger; collapse the all-failed throw into a single
  check on the combined dispatched counter
- Remove dispatchDocumentProcessingJob — its trigger branch is no longer
  reachable now that batchTrigger handles the trigger path, and the direct
  branch is inlined
tasks.batchTrigger returns a batchId per call. Collecting and logging
them after dispatch makes it possible to look up or cancel batches in
the Trigger.dev dashboard when investigating stuck or missing documents.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Symmetry polish: dispatchInProcess now includes [requestId] in its error
log so direct-mode failures are correlatable the same way trigger-mode
failures already are.
Tightens TSDoc on processDocumentsWithQueue, TRIGGER_BATCH_SIZE,
checkSyncLiveness, and the idempotency-key inline comment.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit a2cf183. Configure here.

@waleedlatif1 waleedlatif1 merged commit a1b2130 into staging May 20, 2026
13 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/connector-pipeline-audit branch May 20, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant