Skip to content

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#109

Draft
rgarcia wants to merge 1 commit into
nextfrom
raf/telemetry-cp-fallback
Draft

feat: fall back to the control plane when direct-to-VM routing hits a dead browser#109
rgarcia wants to merge 1 commit into
nextfrom
raf/telemetry-cp-fallback

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented Jun 3, 2026

Opt-in control-plane fallback for browser_gone 404s

Reworks the prior draft, which fell back to the control plane on any 5xx for any idempotent GET. That broad trigger meant a transient 502 from a healthy VM caused a retry of the dead VM and then a control-plane re-issue, adding latency for failures that should just be returned.

This replaces it with a tighter, opt-in, per-endpoint design that assumes the paired metro-api change kernel#2317 lands.

Depends on kernel#2317

When a routed request targets a deleted/gone browser, the VM proxy returns HTTP 404 with JSON body {"code":"browser_gone","message":"browser not found"}. There is no special response header — we key off the body code only. A transient/real upstream failure still returns 5xx; a live VM's own 404 has no browser_gone code.

New fallback semantics

The routing layer keeps a small registry of fallback-eligible routed paths, expressed as (subresource, suffix) and default-OFF for everything else.

We fall back to the control plane iff ALL hold:

  1. the request was actually routed to the VM (allowlisted subresource + a cached route exists);
  2. the HTTP method is GET;
  3. the routed path is in the fallback-eligible registry;
  4. the VM returned 404 and its JSON body code == "browser_gone".

We do NOT fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not browser_gone. Those propagate unchanged.

On fallback we evict the cached route for the session (it's authoritatively gone), then re-issue the original request to the control plane exactly once (original CP URL, Authorization restored, jwt query param dropped). Never loops.

Prospective endpoint

Only the forthcoming pull endpoint GET /browsers/{id}/telemetry/events (("telemetry", "/events")) is pre-registered, clearly commented as PROSPECTIVE — the telemetry.events(...) method does not exist yet. This wires the opt-in so fallback works the moment that method ships. Adding any future eligible endpoint is a one-line registry edit.

Scoping

This PR is focused on the fallback. It does NOT modify the default routing subresource list (that belongs to the separate telemetry-default-routing PR). git diff origin/next does not touch the default-subresources constant; tests enable telemetry routing locally.

Verification

  • tests/test_browser_routing.py: 33 passing. Covers eligible+GET+browser_gone -> CP fallback (Authorization restored, jwt dropped, route evicted, exactly one CP re-issue), browser_gone+CP-also-errors -> returned as-is no loop, non-eligible path / 502 / connection error / 200 / POST / plain-404 -> no fallback, and non-routed requests untouched, plus a unit test of the eligibility registry. Sync and async parity covered.
  • ruff check clean on all changed files.
  • Live scoping QA against prod (browser_gone not deployed yet; positive path proven by unit tests): (A) telemetry stream routed directly to the VM proxy host, not api.onkernel.com; (B) after deleting the browser, the stale telemetry-stream route returned the VM 502 to the caller with ZERO control-plane re-issues — correct, since /stream is not eligible and 502 is not browser_gone.

Files: src/kernel/_client.py, src/kernel/lib/browser_routing/routing.py, tests/test_browser_routing.py.

Replace the prior broad-trigger fallback (fall back on any 5xx for any
idempotent GET) with a tighter, opt-in design that assumes the paired
metro-api change (kernel#2317): a routed request against a deleted/gone
browser returns HTTP 404 with body {"code":"browser_gone"}.

The routing layer now keeps a small registry of fallback-ELIGIBLE routed
paths (subresource + suffix), default-OFF for everything else. Only the
prospective GET /browsers/{id}/telemetry/events endpoint is pre-registered;
adding future eligible endpoints is a one-line edit.

Kernel.request / AsyncKernel.request fall back to the control plane IFF the
request was actually routed to the VM, the method is GET, the routed path is
eligible, and the VM returned a 404 whose JSON body code == "browser_gone".
On fallback we evict the cached route and re-issue the ORIGINAL request to
the control plane exactly once (CP URL, Authorization restored, jwt param
dropped). Transient 5xx, connection errors, other 4xx, success, non-eligible
paths, POSTs, and non-routed requests all propagate unchanged.

This PR intentionally does NOT modify the default routing subresource list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the raf/telemetry-cp-fallback branch from fd1d2ec to cfc1aa1 Compare June 3, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant