feat: fall back to the control plane when direct-to-VM routing hits a dead browser#110
Draft
rgarcia wants to merge 1 commit into
Draft
feat: fall back to the control plane when direct-to-VM routing hits a dead browser#110rgarcia wants to merge 1 commit into
rgarcia wants to merge 1 commit into
Conversation
Replace the prior broad-trigger fallback (any 5xx on any idempotent GET)
with a tighter, opt-in design that assumes metro-api kernel#2317: a
routed request targeting a deleted/gone browser returns HTTP 404 with a
JSON body code "browser_gone".
Fallback now fires IFF all hold: the request was actually routed to the
VM (allowlisted subresource + cached route), method is GET, the routed
path is in a fallback-eligible registry, and the response is 404 with
body code "browser_gone". On fallback the cached route is evicted and
the original control-plane request is replayed exactly once (restoring
Authorization, dropping the jwt query param). Transient 5xx, connection
errors, other 4xx, and non-browser_gone 404s propagate unchanged.
The eligibility registry pre-registers only the prospective pull
endpoint GET /browsers/{id}/telemetry/events. Adding future eligible
endpoints is a one-line registry edit. The default routing subresource
list is intentionally not modified by this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f3c0e7c to
4e86417
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks the draft control-plane fallback for the direct-to-VM routing layer. The previous draft fell back on any 5xx for any idempotent GET, which forced a slow round-trip to a dead/slow VM before falling back. This replaces that with a tighter, opt-in,
browser_gone-only trigger.Depends on metro-api kernel#2317: when a routed request targets a deleted/gone browser, the VM proxy returns HTTP 404 with JSON body
{"code":"browser_gone","message":"browser not found"}(no special header — we key off the body code only). A transient/real upstream failure still returns 5xx; a live VM's own 404 has nobrowser_gonecode.New fallback semantics
Fallback to the control plane fires iff ALL hold:
GET;code == "browser_gone".On fallback: evict the cached route for that session (authoritatively gone), then re-issue the original request to the control plane exactly once (original CP URL, restore
Authorization, drop the injectedjwtquery param). The CP response is returned verbatim — never loops back through VM routing.Does not fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not
browser_gone. Those propagate unchanged — so a transient 502 is simply returned (fixes the old "retry the dead VM then fall back" latency problem).Body handling: the response body is only inspected on a 404, and is buffered/restored so that when we do not fall back the original body is still returned intact to the caller.
Per-endpoint opt-in registry
Eligibility lives in a small routing-layer registry keyed by
{subresource, suffix}, default-OFF for everything else. It pre-registers only the forthcoming pull endpoint:GET /browsers/{id}/telemetry/events(subresourcetelemetry, suffix/events). That method does not exist yet; this pre-wires the opt-in so fallback works the moment it ships.Adding a future eligible endpoint is a one-line registry edit.
Scoping
This PR is focused on the fallback. It does not modify the default routing subresource list (that belongs to the separate telemetry-default-routing PR).
git diff origin/nexttouches onlylib/browserrouting/route_cache.goand its test. Tests that needtelemetryrouted enable it explicitly.Testing
go build ./...,gofmt -l,go vet ./...,go test -count=1 . ./lib/browserrouting/(incl.-race) all pass.browser_gone→ CP fallback: Authorization restored, jwt dropped, route evicted, exactly one CP re-issue) and every negative: CP-also-errors-no-loop, ineligible path, eligible-subresource-wrong-suffix, transient 5xx, connection error, success, non-browser_gone404, other 4xx (403), POST, and non-routed requests. The positive path is proven by unit tests becausebrowser_goneis not yet deployed to prod.KERNEL_BROWSER_ROUTING_SUBRESOURCES=telemetry):proxy.<metro>.onkernel.com:8443, 200), notapi.onkernel.com.api.onkernel.comre-issues (the stream path is not in the eligible registry and 502 is notbrowser_gone).