feat(webapp): plan-aware compute migration#3957
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Addressed the review feedback, plus a few issues a deeper review pass turned up:
Operational notes for rollout:
|
3cf484d to
697de03
Compare
|
Follow-up: replaced the
Rollout requirement: set Treat |
…egion, drop env map
a1a460e to
b75e18a
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.
Routing is gated by three global feature flags -
computeMigrationEnabled,computeMigrationFreePercentage,computeMigrationPaidPercentage- plus a per-orgcomputeMigrationEnabledoverride that wins in both directions. A region's compute backing is resolved from a newWorkerInstanceGroup.regioncolumn: a container group and its MicroVM group share one georegion, so the migration swaps the resolved worker queue to the backing group's queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no compute backing is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.The flags and the worker-region groups are read on the trigger hot path from in-memory snapshots rather than the database: a small
createReloadingRegistryhelper loads each at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica that hasn't loaded yet falls back to off (the container path). The same migration decision is consulted at deploy-time template creation so a migrated org still gets a compute template built, in shadow mode so it never fails the deploy.So operators keep "which runs ran where" while customers only see geography: the run's actual worker queue is stored raw, and the geo region is stamped separately on
TaskRun.region(and a new ClickHouseregioncolumn) at trigger time. Read surfaces - the dashboard, the API, and the Query/Logs page - show the geo region, falling back to the worker queue for runs written before the column existed.Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and
createReloadingRegistrycould later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.