Skip to content

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#116

Draft
rgarcia wants to merge 5 commits into
nextfrom
raf/telemetry-cp-fallback
Draft

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#116
rgarcia wants to merge 5 commits into
nextfrom
raf/telemetry-cp-fallback

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented Jun 3, 2026

What

Reworks the direct-to-VM routing layer to fall back to the control plane only when a routed request hits an authoritatively gone browser, replacing the previous broad-trigger draft (which fell back on any 5xx for any idempotent GET).

Pairs with metro-api kernel#2317 (assumed to deploy): a routed request to a deleted/gone browser returns HTTP 404 with JSON body {"code":"browser_gone","message":"browser not found"}. There is no special response header — we key off the body code only. A transient/real upstream failure still returns 5xx; a live VM's own 404 has no browser_gone code.

New fallback semantics

Fallback fires iff all of these hold:

  1. the request was actually routed to the VM (allowlisted subresource + cached route),
  2. HTTP method is GET,
  3. the routed (subresource + suffix) path is in the per-endpoint fallback-eligible registry,
  4. the VM returns HTTP 404 whose JSON body has code == "browser_gone".

On fallback: evict the now-stale cached route, then re-issue the original request to the control plane exactly once (original CP URL, restore Authorization, drop the jwt query param). Return that response. Never loops.

Does not fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not browser_gone. These propagate unchanged — fixing the old "retry the dead VM then fall back" latency problem (a transient 502 is just returned).

Body handling: the 404 body is inspected via response.clone(), so when we do not fall back the original response body is returned intact to the caller.

Per-endpoint opt-in registry

FALLBACK_ELIGIBLE_ROUTES is default-OFF for everything. It pre-registers only the prospective pull endpoint GET /browsers/{id}/telemetry/events (that SDK method does not exist yet; this wires the opt-in so fallback works the moment it ships). Adding a future eligible endpoint is a one-line registry edit.

Scope

  • Does not modify the default routing subresource list (owned by the separate telemetry-default-routing PR); git diff origin/next touches only src/lib/browser-routing.ts and its test.
  • Depends on kernel#2317.

Tests

tests/lib/browser-routing.test.ts covers: eligible GET + 404 browser_gone -> CP fallback (Authorization restored, jwt dropped, route evicted, exactly one CP re-issue); CP-also-errors -> returned as-is, no loop; not-eligible path + 404 browser_gone -> no fallback; eligible + 502 / connection error -> no fallback (propagated); eligible + 200 -> no fallback; eligible but POST -> no fallback; non-browser_gone 404 -> no fallback; body-code-only (no content-type gate) positive + non-JSON-body negative; non-routed cache-miss untouched.

Live QA (prod)

browser_gone and the telemetry/events endpoint are not deployed yet, so the positive fallback path is proven by unit tests (mocked 404 browser_gone). Live QA proves the trigger is correctly scoped:

  • (A) telemetry stream on a live browser went directly to the VM proxy host (not api.onkernel.com).
  • (B) after deleting the browser, streaming the stale route returned the VM 502 and the SDK returned itzero control-plane re-issues — because the stream path is not eligible and 502 is not browser_gone.

🤖 Generated with Claude Code

rgarcia and others added 5 commits June 3, 2026 11:41
Add "telemetry" to the default KERNEL_BROWSER_ROUTING_SUBRESOURCES list so
telemetry SSE streams are routed straight to the browser VM, and change the
telemetry stream method path from /browsers/{id}/telemetry to
/browsers/{id}/telemetry/stream so the direct-routing rewrite yields
{base_url}/telemetry/stream on the VM (the VM's /telemetry is a different,
non-streaming endpoint).

DEPENDS ON the control-plane PR renaming the public endpoint
/browsers/{id}/telemetry -> /browsers/{id}/telemetry/stream. Until that
deploys, telemetry.stream() only works via direct routing.

Verified with a live smoke test against prod: the telemetry stream request
is rewritten to the VM proxy host (.../telemetry/stream?jwt=...), the
Authorization header is stripped, and an api_call telemetry event arrives
within ~1s of generating activity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat: route browser telemetry directly to the VM by default
Rework the routing-layer control-plane fallback to a tight, opt-in design
that pairs with metro-api kernel#2317 (routed requests to a deleted/gone
browser return HTTP 404 with body {"code":"browser_gone"}).

Replaces the previous broad trigger (fall back on any 5xx for any
idempotent GET), which retried the dead VM then fell back, adding latency
on transient errors.

New semantics in routeRequest, applied only when the request was actually
routed to the VM (allowlisted subresource + cached route): fall back IFF
method is GET, the routed (subresource + suffix) path is in the
fallback-eligible registry, and the VM returns HTTP 404 whose JSON body has
code == "browser_gone". On fallback, evict the cached route and re-issue the
ORIGINAL request to the control plane exactly once (original URL, restore
Authorization, drop the jwt query param); return that response, never loop.
Success, transient 5xx, network errors, other 4xx, and non-browser_gone
404s propagate unchanged. The 404 body is read via response.clone() so a
non-fallback response is returned intact.

Per kernel#2317 there is no special response header, so the gone check keys
off the body code only (no content-type gate), matching the python SDK.

Adds an isFallbackEligible(subresource, suffix) predicate backed by a
registry that is default-OFF. Pre-registers only the prospective pull
endpoint GET /browsers/{id}/telemetry/events; adding future eligible
endpoints is a one-line registry edit.

Scoped to the fallback only: does not modify the default routing
subresource list (owned by the telemetry-default-routing PR).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant