Skip to content

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#110

Draft
rgarcia wants to merge 1 commit into
nextfrom
raf/telemetry-cp-fallback
Draft

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#110
rgarcia wants to merge 1 commit into
nextfrom
raf/telemetry-cp-fallback

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented Jun 3, 2026

Summary

Reworks the draft control-plane fallback for the direct-to-VM routing layer. The previous draft fell back on any 5xx for any idempotent GET, which forced a slow round-trip to a dead/slow VM before falling back. This replaces that with a tighter, opt-in, browser_gone-only trigger.

Depends on metro-api kernel#2317: when a routed request targets a deleted/gone browser, the VM proxy returns HTTP 404 with JSON body {"code":"browser_gone","message":"browser not found"} (no special header — we key off the body code only). A transient/real upstream failure still returns 5xx; a live VM's own 404 has no browser_gone code.

New fallback semantics

Fallback to the control plane fires iff ALL hold:

  1. the request was actually routed to the VM (allowlisted subresource and a cached route);
  2. HTTP method is GET;
  3. the routed path is in the fallback-eligible registry;
  4. the VM response is HTTP 404 AND its JSON body has code == "browser_gone".

On fallback: evict the cached route for that session (authoritatively gone), then re-issue the original request to the control plane exactly once (original CP URL, restore Authorization, drop the injected jwt query param). The CP response is returned verbatim — never loops back through VM routing.

Does not fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not browser_gone. Those propagate unchanged — so a transient 502 is simply returned (fixes the old "retry the dead VM then fall back" latency problem).

Body handling: the response body is only inspected on a 404, and is buffered/restored so that when we do not fall back the original body is still returned intact to the caller.

Per-endpoint opt-in registry

Eligibility lives in a small routing-layer registry keyed by {subresource, suffix}, default-OFF for everything else. It pre-registers only the forthcoming pull endpoint:

  • PROSPECTIVE: GET /browsers/{id}/telemetry/events (subresource telemetry, suffix /events). That method does not exist yet; this pre-wires the opt-in so fallback works the moment it ships.

Adding a future eligible endpoint is a one-line registry edit.

Scoping

This PR is focused on the fallback. It does not modify the default routing subresource list (that belongs to the separate telemetry-default-routing PR). git diff origin/next touches only lib/browserrouting/route_cache.go and its test. Tests that need telemetry routed enable it explicitly.

Testing

  • go build ./..., gofmt -l, go vet ./..., go test -count=1 . ./lib/browserrouting/ (incl. -race) all pass.
  • Unit tests cover the positive path (eligible + GET + 404 browser_gone → CP fallback: Authorization restored, jwt dropped, route evicted, exactly one CP re-issue) and every negative: CP-also-errors-no-loop, ineligible path, eligible-subresource-wrong-suffix, transient 5xx, connection error, success, non-browser_gone 404, other 4xx (403), POST, and non-routed requests. The positive path is proven by unit tests because browser_gone is not yet deployed to prod.
  • Live scoping QA against prod (telemetry browser, KERNEL_BROWSER_ROUTING_SUBRESOURCES=telemetry):
    • (A) opening the telemetry stream on a live browser routed directly to the VM proxy host (proxy.<metro>.onkernel.com:8443, 200), not api.onkernel.com.
    • (B) re-opening the stream on a stale route after the browser was deleted returned the VM 502 and did not fall back — every outbound attempt went to the VM host, with zero api.onkernel.com re-issues (the stream path is not in the eligible registry and 502 is not browser_gone).

Replace the prior broad-trigger fallback (any 5xx on any idempotent GET)
with a tighter, opt-in design that assumes metro-api kernel#2317: a
routed request targeting a deleted/gone browser returns HTTP 404 with a
JSON body code "browser_gone".

Fallback now fires IFF all hold: the request was actually routed to the
VM (allowlisted subresource + cached route), method is GET, the routed
path is in a fallback-eligible registry, and the response is 404 with
body code "browser_gone". On fallback the cached route is evicted and
the original control-plane request is replayed exactly once (restoring
Authorization, dropping the jwt query param). Transient 5xx, connection
errors, other 4xx, and non-browser_gone 404s propagate unchanged.

The eligibility registry pre-registers only the prospective pull
endpoint GET /browsers/{id}/telemetry/events. Adding future eligible
endpoints is a one-line registry edit. The default routing subresource
list is intentionally not modified by this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the raf/telemetry-cp-fallback branch from f3c0e7c to 4e86417 Compare June 3, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant