Skip to content

improvement(cleanup): batchTrigger fan-out, chunked queries, batched S3, faster outlier drain#4688

Merged
waleedlatif1 merged 2 commits into
stagingfrom
waleedlatif1/trigger-cleanup-larger-machine
May 21, 2026
Merged

improvement(cleanup): batchTrigger fan-out, chunked queries, batched S3, faster outlier drain#4688
waleedlatif1 merged 2 commits into
stagingfrom
waleedlatif1/trigger-cleanup-larger-machine

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Fan cleanup-tasks/logs/soft-deletes out via tasks.batchTrigger (500 ws/chunk); bump to large-1x with concurrencyLimit: 5
  • Chunk bulk DELETEs (1000 IDs/stmt) and collectChatFiles JSONB SELECT (500 chats/stmt) to bound worker memory and lock duration
  • Replace per-key position() table scans with one LATERAL unnest scan per 200-key chunk
  • Route storage deletes through StorageService.deleteFiles (S3 DeleteObjects: 1000 keys/HTTP)
  • Raise per-run row cap to 100K so long-tail tenants (one prod workspace has 723K doomed rows) drain in days, not weeks

Type of Change

  • Improvement

Testing

  • Verified position-query SQL rewrite returns identical results to original against local Postgres with seeded data
  • tsc, biome check, check:api-validation all pass
  • 98 adjacent tests pass (uploads, snapshot service, billing core)
  • Trigger.dev batchTrigger usage validated against official docs (SDK 4.4.3, all options within documented caps)

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…S3, faster outlier drain

- Fan cleanup-tasks/logs/soft-deletes out via tasks.batchTrigger (500 ws/chunk); bump to large-1x with concurrencyLimit: 5
- Chunk bulk DELETEs (1000 IDs/stmt) and collectChatFiles JSONB SELECT (500 chats/stmt) to bound worker memory and lock duration
- Replace per-key position() table scans with one LATERAL unnest scan per 200-key chunk
- Route storage deletes through StorageService.deleteFiles (S3 DeleteObjects: 1000 keys/HTTP)
- Raise per-run row cap to 100K so long-tail tenants (one prod workspace has 723K doomed rows) drain in days, not weeks
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 21, 2026 2:20am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Medium Risk
Touches retention job dispatching and bulk deletion paths (DB + storage), so misconfiguration or logic errors could lead to missed cleanups or excessive load, but changes are primarily batching/throughput controls.

Overview
Retention cleanup jobs are now dispatched and executed in fixed-size chunks: dispatchCleanupJobs pre-resolves workspaces/retention, fans out via tasks.batchTrigger (or queue fallback), and updates payloads to carry workspaceIds, retentionHours, and label (with a runGlobalHousekeeping flag for one-off plan-wide work).

Cleanup execution is tuned for scale and bounded resource usage: cleanup tasks run on large-1x with concurrencyLimit: 5, per-run DB delete capacity is increased (100K cap), explicit-ID deletes are chunked (1000/statement), chat file collection is chunked (500 chats/query), and large-value reference checks in cleanup-logs are rewritten to chunked unnest scans instead of per-key queries.

Storage deletion is batched: cleanup flows now group keys by storage context and call StorageService.deleteFiles, which adds provider-aware bulk deletion (S3 DeleteObjects in 1000-key requests, otherwise bounded-concurrency per-file) and surfaces per-key failures for logging.

Reviewed by Cursor Bugbot for commit ddbdacb. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR overhauls the cleanup pipeline for scale: workspace chunks are fanned out via tasks.batchTrigger, bulk DELETEs and JSONB SELECTs are chunked to bound lock duration and memory, the per-key position() scan loop is replaced with a single LATERAL unnest pass per 200-key chunk, and S3 storage deletes are routed through the new StorageService.deleteFiles (DeleteObjects) batch API.

  • Dispatcher (cleanup-dispatcher.ts): replaces sequential jobQueue.enqueue calls with tasks.batchTrigger; adds 500-workspace chunks with per-chunk labels (free/1, free/2); runGlobalHousekeeping is pinned to the first matching chunk.
  • Batch-delete primitives (batch-delete.ts): new chunkArray, selectRowsByIdChunks, and deleteRowsById helpers; per-run row cap raised to 100K; delete chunks capped at 1000 IDs to bound FK-trigger queue length.
  • Storage (storage-service.ts, s3/client.ts): new deleteFiles uses DeleteObjects (1000 keys/HTTP) for S3 and falls back to a 25-worker concurrent loop for Blob/local.

Confidence Score: 5/5

Safe to merge; all changed paths are background cleanup jobs with no user-facing data mutations, and the chunking logic is correct.

The LATERAL unnest SQL rewrite is logically equivalent to the original per-key loop and handles multi-batch cross-workspace key sharing correctly. The new deleteRowsById and selectRowsByIdChunks primitives are well-bounded. The two observations flagged are low-probability edge cases that do not affect correctness under normal operating conditions.

cleanup-dispatcher.ts has a minor jobCount semantic change worth confirming with the monitoring team; cleanup-tasks.ts has a theoretical run-child ordering edge case only reachable with more than 100K eligible runs per workspace chunk.

Important Files Changed

Filename Overview
apps/sim/lib/billing/cleanup-dispatcher.ts Rewrites dispatch to use tasks.batchTrigger with 500-workspace/chunk fan-out; jobCount in the return value now reflects the number of batchTrigger API calls (typically 1), not the number of task runs triggered
apps/sim/lib/cleanup/batch-delete.ts Adds chunked ID-list DELETE (deleteRowsById), SELECT helper (selectRowsByIdChunks), and raises the per-run cap to 100K; well-guarded with accurate upper-bound failure semantics
apps/sim/background/cleanup-logs.ts Replaces N per-key position() scans with a LATERAL unnest scan per 200-key chunk; correctness verified — deletedLogIds are excluded so only retained rows are checked for references
apps/sim/background/cleanup-tasks.ts Pre-selects doomed chat IDs for both copilot backend cleanup and DB deletion; run children deleted before parent runs to respect FK ordering
apps/sim/lib/uploads/core/storage-service.ts Adds deleteFiles() using S3 DeleteObjects for batch deletes; Blob path falls back to bounded-concurrency per-file loop; correctly exported via export * as StorageService
apps/sim/lib/uploads/providers/s3/client.ts Adds deleteManyFromS3 with 1000-key chunking and Quiet:true; correctly collects per-key errors from response.Errors and network-level errors separately

Sequence Diagram

sequenceDiagram
    participant Cron as Cron Route
    participant Dispatcher as cleanup-dispatcher
    participant Trigger as Trigger.dev batchTrigger
    participant Task as cleanup-* task (xN)
    participant DB as Postgres
    participant S3 as S3 DeleteObjects

    Cron->>Dispatcher: dispatchCleanupJobs(jobType)
    Dispatcher->>DB: listActiveWorkspaceCleanupScopeRows()
    Dispatcher->>DB: resolvePersonalPlanTypes / getOrgSubscription
    Dispatcher->>Dispatcher: buildCleanupChunks() 500 ws/chunk
    Dispatcher->>Trigger: tasks.batchTrigger up to 1000 payloads
    Trigger-->>Dispatcher: batchId
    Dispatcher-->>Cron: jobIds chunkCount workspaceCount

    Note over Task: Runs concurrently concurrencyLimit 5
    Task->>DB: selectRowsByIdChunks 50 batches x 2000 rows
    Task->>DB: chunkedBatchDelete onBatch filterLargeValueKeys LATERAL unnest 200 keys/chunk
    Task->>S3: StorageService.deleteFiles deleteManyFromS3 1000 keys/HTTP
    Task->>DB: DELETE WHERE id IN chunkIds 1000 IDs/stmt
Loading

Reviews (2): Last reviewed commit: "improvement(cleanup): chunk-index labels..." | Re-trigger Greptile

Comment thread apps/sim/lib/billing/cleanup-dispatcher.ts Outdated
Comment thread apps/sim/lib/cleanup/batch-delete.ts
… counter

Addresses Greptile review feedback:
- Disambiguate downstream logs when a plan splits into multiple workspace chunks (e.g. 'free/1', 'free/2')
- Document that deleteRowsById's failed counter is an upper bound (chunk rolls back to 0 deletes on error)
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit ddbdacb. Configure here.

@waleedlatif1 waleedlatif1 merged commit 11ad891 into staging May 21, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/trigger-cleanup-larger-machine branch May 21, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant