feat(webapp): plan-aware compute migration by nicktrn · Pull Request #3957 · triggerdotdev/trigger.dev

nicktrn · 2026-06-15T16:31:28Z

Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.

Routing is gated by three global feature flags - computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage - plus a per-org computeMigrationEnabled override that wins in both directions. A region's compute backing is resolved from a new WorkerInstanceGroup.region column: a container group and its MicroVM group share one geo region, so the migration swaps the resolved worker queue to the backing group's queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no compute backing is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.

The flags and the worker-region groups are read on the trigger hot path from in-memory snapshots rather than the database: a small createReloadingRegistry helper loads each at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica that hasn't loaded yet falls back to off (the container path). The same migration decision is consulted at deploy-time template creation so a migrated org still gets a compute template built, in shadow mode so it never fails the deploy.

So operators keep "which runs ran where" while customers only see geography: the run's actual worker queue is stored raw, and the geo region is stamped separately on TaskRun.region (and a new ClickHouse region column) at trigger time. Read surfaces - the dashboard, the API, and the Query/Logs page - show the geo region, falling back to the worker queue for runs written before the column existed.

Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and createReloadingRegistry could later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.

changeset-bot · 2026-06-15T16:31:35Z

⚠️ No Changeset found

Latest commit: 188dac2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-15T16:31:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic createReloadingRegistry utility with Prometheus metrics, p-retry startup loading, and periodic refresh. A new workerRegionRegistry loads WorkerGroupRegionRow data from the database and exposes regionForQueue and backingForQueue helpers; the WorkerInstanceGroup table gains a nullable region TEXT column via migration. Three feature flags (computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage) and two new environment variables (GLOBAL_FLAGS_RELOAD_INTERVAL_MS, GLOBAL_FLAGS_READY_TIMEOUT_MS) are added. A globalFlagsRegistry singleton caches global flags from the database. An FNV-1a hashBucket function and isOrgMigrated/resolveComputeMigration functions implement the enrollment decision and queue rewrite logic. TaskRun gains a region column persisted by RunEngine.trigger. The triggerTask and computeTemplateCreation services are updated to evaluate migration at routing time and rewrite worker queues to compute backing when enrolled. Region derivation across presenters, routes, and the ClickHouse replication service is updated to use explicit region field when present. ClickHouse task_runs_v2 table gains a region column for analytics.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(webapp): plan-aware compute migration' is specific, concise, and accurately describes the main feature being added—a plan-aware mechanism for compute migration controlled by feature flags and percentage bucketing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	PR description is detailed and comprehensive, covering implementation approach, feature flags, registry mechanism, and important follow-ups, though it lacks explicit testing documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/compute-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

nicktrn · 2026-06-15T19:25:16Z

Addressed the review feedback, plus a few issues a deeper review pass turned up:

Replay of a migrated run would have silently produced no run: the stored backing queue (us-east-1-next) was read back as an explicit region override and rejected by the compute-access gate. Replay now reverse-maps the stored backing to its geo region and re-resolves, so migration re-applies with current flags (and an org that's since been excluded replays onto the container path).
Backing hidden on customer surfaces: a regionForBacking inverse of COMPUTE_BACKING_MAP is applied at the run API, run list, run detail, replay, and the ClickHouse worker_queue write, so the API / dashboard / Query feature all report the geo region. The raw backing stays on TaskRun.workerQueue in Postgres for internal use - no schema change.
Registry: reloads are now sequence-guarded so a slow older reload can't overwrite a newer snapshot (the kill switch can't silently revert), and waitUntilReady clears its timeout instead of leaking one per cold-start trigger.
Kill switch uses strict z.boolean() (coercion turned the string "false" into true); the reload interval is now bounded.

Operational notes for rollout:

Billing should key off machine preset / actual execution, not hasComputeAccess - migrated orgs run on the backing without that flag.
The compute backing needs its own :scheduled consumer for scheduled runs.
The deprecated V3 batch path doesn't percentage-enroll (it passes skipChecks without a plan type); per-org overrides still apply there.

nicktrn · 2026-06-15T20:06:09Z

Follow-up: replaced the COMPUTE_BACKING_MAP env var with a region column on WorkerInstanceGroup, so region<->backing resolution comes from data instead of editable config (removes the "edit a config blob and silently break reverse-mapping for historical runs" footgun).

New nullable WorkerInstanceGroup.region (migration ..._add_worker_instance_group_region). Container and compute groups for one geo share the value - e.g. both us-east-1 and us-east-1-next get region = "us-east-1".
A workerRegionRegistry (same createReloadingRegistry pattern, in-memory snapshot) serves both directions off the hot path: forward (region -> its MICROVM backing) at trigger, reverse (a stored queue -> its geo region) at the presenters / replay / ClickHouse write.
COMPUTE_BACKING_MAP and computeBackingMap.server.ts deleted.

Rollout requirement: set region on the live worker groups before enabling migration. It's nullable - unset means that group never migrates and resolves to its own queue (safe no-op). Backfill the container + compute groups of each geo to the same region value.

Treat region as set-once while a group has run history: changing it re-breaks region resolution for existing runs. The durability win is that this is now one immutable data field rather than an editable config map.

…rom query

…bucket test

… registry

…egion, drop env map

…afe registries

…egion fallback

pkg-pr-new · 2026-06-15T22:10:25Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@b75e18a

trigger.dev

npm i https://pkg.pr.new/trigger.dev@b75e18a

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@b75e18a

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@b75e18a

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@b75e18a

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@b75e18a

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@b75e18a

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@b75e18a

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@b75e18a

commit: b75e18a

…an detail

…form-free

nicktrn self-assigned this Jun 15, 2026

devin-ai-integration Bot reviewed Jun 15, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

0ski approved these changes Jun 15, 2026

View reviewed changes

nicktrn force-pushed the feat/compute-migration branch from 3cf484d to 697de03 Compare June 15, 2026 20:05

This comment was marked as resolved.

Sign in to view

nicktrn added 17 commits June 15, 2026 23:06

feat(webapp): add compute migration feature flags

846c541

feat(webapp): add deterministic org hashBucket for rollout

8b12326

feat(webapp): add compute migration env config

eeca099

feat(webapp): add compute migration resolver

7e41492

feat(webapp): add createReloadingRegistry helper

b36b83c

feat(webapp): boot global flags registry

7067ac7

feat(webapp): route migrated orgs to the compute backing at trigger

266fc60

feat(webapp): build compute template for migrated orgs

58d8fe5

chore(webapp): server-changes note for compute migration

c4e0dcf

test(webapp): move hashBucket test into test/ so vitest includes it

65ce342

fix(webapp): hide compute backing on read surfaces and fix replay

4791b01

fix(webapp): store geo region in clickhouse to hide compute backing f…

8060760

…rom query

fix(webapp): serialize registry reloads and clear readiness timeout

25e3fe1

fix(webapp): strict boolean kill switch, bound reload interval, cuid …

18d5b11

…bucket test

feat(database,webapp): add WorkerInstanceGroup.region + worker-region…

aefdf3a

… registry

refactor(webapp): resolve region<->backing from WorkerInstanceGroup.r…

53eaa7a

…egion, drop env map

feat(webapp): stamp geo region on runs, keep worker_queue raw, test-s…

222d653

…afe registries

fix(webapp): fail-open entitlement lookup in migration mode, harden r…

b75e18a

…egion fallback

nicktrn force-pushed the feat/compute-migration branch from a1a460e to b75e18a Compare June 15, 2026 22:07

This comment was marked as resolved.

Sign in to view

nicktrn added 3 commits June 15, 2026 23:15

docs(webapp): trim server-changes note to one behavior-level line

ea9a071

refactor(webapp): centralize display-region fallback in regionForDisplay

e3524e9

fix(webapp): look up run's actual worker group, show geo region in sp…

126a01f

…an detail

This comment was marked as resolved.

Sign in to view

docs(webapp): trim hot-path comment, note region must stay whereTrans…

188dac2

…form-free

Uh oh!

Conversation

nicktrn commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

nicktrn commented Jun 15, 2026

Uh oh!

nicktrn commented Jun 15, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jun 15, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nicktrn commented Jun 15, 2026 •

edited

Loading

changeset-bot Bot commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading