feat: fall back to the control plane when direct-to-VM routing hits a dead browser#116
Draft
rgarcia wants to merge 5 commits into
Draft
feat: fall back to the control plane when direct-to-VM routing hits a dead browser#116rgarcia wants to merge 5 commits into
rgarcia wants to merge 5 commits into
Conversation
Add "telemetry" to the default KERNEL_BROWSER_ROUTING_SUBRESOURCES list so
telemetry SSE streams are routed straight to the browser VM, and change the
telemetry stream method path from /browsers/{id}/telemetry to
/browsers/{id}/telemetry/stream so the direct-routing rewrite yields
{base_url}/telemetry/stream on the VM (the VM's /telemetry is a different,
non-streaming endpoint).
DEPENDS ON the control-plane PR renaming the public endpoint
/browsers/{id}/telemetry -> /browsers/{id}/telemetry/stream. Until that
deploys, telemetry.stream() only works via direct routing.
Verified with a live smoke test against prod: the telemetry stream request
is rewritten to the VM proxy host (.../telemetry/stream?jwt=...), the
Authorization header is stripped, and an api_call telemetry event arrives
within ~1s of generating activity.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat: route browser telemetry directly to the VM by default
Rework the routing-layer control-plane fallback to a tight, opt-in design
that pairs with metro-api kernel#2317 (routed requests to a deleted/gone
browser return HTTP 404 with body {"code":"browser_gone"}).
Replaces the previous broad trigger (fall back on any 5xx for any
idempotent GET), which retried the dead VM then fell back, adding latency
on transient errors.
New semantics in routeRequest, applied only when the request was actually
routed to the VM (allowlisted subresource + cached route): fall back IFF
method is GET, the routed (subresource + suffix) path is in the
fallback-eligible registry, and the VM returns HTTP 404 whose JSON body has
code == "browser_gone". On fallback, evict the cached route and re-issue the
ORIGINAL request to the control plane exactly once (original URL, restore
Authorization, drop the jwt query param); return that response, never loop.
Success, transient 5xx, network errors, other 4xx, and non-browser_gone
404s propagate unchanged. The 404 body is read via response.clone() so a
non-fallback response is returned intact.
Per kernel#2317 there is no special response header, so the gone check keys
off the body code only (no content-type gate), matching the python SDK.
Adds an isFallbackEligible(subresource, suffix) predicate backed by a
registry that is default-OFF. Pre-registers only the prospective pull
endpoint GET /browsers/{id}/telemetry/events; adding future eligible
endpoints is a one-line registry edit.
Scoped to the fallback only: does not modify the default routing
subresource list (owned by the telemetry-default-routing PR).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
acf0e0e to
94b7121
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reworks the direct-to-VM routing layer to fall back to the control plane only when a routed request hits an authoritatively gone browser, replacing the previous broad-trigger draft (which fell back on any 5xx for any idempotent GET).
Pairs with metro-api kernel#2317 (assumed to deploy): a routed request to a deleted/gone browser returns HTTP 404 with JSON body
{"code":"browser_gone","message":"browser not found"}. There is no special response header — we key off the body code only. A transient/real upstream failure still returns 5xx; a live VM's own 404 has nobrowser_gonecode.New fallback semantics
Fallback fires iff all of these hold:
GET,(subresource + suffix)path is in the per-endpoint fallback-eligible registry,code == "browser_gone".On fallback: evict the now-stale cached route, then re-issue the original request to the control plane exactly once (original CP URL, restore Authorization, drop the
jwtquery param). Return that response. Never loops.Does not fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not
browser_gone. These propagate unchanged — fixing the old "retry the dead VM then fall back" latency problem (a transient 502 is just returned).Body handling: the 404 body is inspected via
response.clone(), so when we do not fall back the original response body is returned intact to the caller.Per-endpoint opt-in registry
FALLBACK_ELIGIBLE_ROUTESis default-OFF for everything. It pre-registers only the prospective pull endpointGET /browsers/{id}/telemetry/events(that SDK method does not exist yet; this wires the opt-in so fallback works the moment it ships). Adding a future eligible endpoint is a one-line registry edit.Scope
git diff origin/nexttouches onlysrc/lib/browser-routing.tsand its test.Tests
tests/lib/browser-routing.test.tscovers: eligible GET + 404browser_gone-> CP fallback (Authorization restored, jwt dropped, route evicted, exactly one CP re-issue); CP-also-errors -> returned as-is, no loop; not-eligible path + 404browser_gone-> no fallback; eligible + 502 / connection error -> no fallback (propagated); eligible + 200 -> no fallback; eligible but POST -> no fallback; non-browser_gone404 -> no fallback; body-code-only (no content-type gate) positive + non-JSON-body negative; non-routed cache-miss untouched.Live QA (prod)
browser_goneand thetelemetry/eventsendpoint are not deployed yet, so the positive fallback path is proven by unit tests (mocked 404browser_gone). Live QA proves the trigger is correctly scoped:browser_gone.🤖 Generated with Claude Code