Skip to content

Rework telemetry event categories#268

Merged
Sayan- merged 4 commits into
mainfrom
telemetry-category-cleanup
Jun 3, 2026
Merged

Rework telemetry event categories#268
Sayan- merged 4 commits into
mainfrom
telemetry-category-cleanup

Conversation

@Sayan-
Copy link
Copy Markdown
Contributor

@Sayan- Sayan- commented Jun 3, 2026

Summary

Cleans up the browser telemetry category taxonomy before it's really in customers' hands. The driving problems: system had become a junk drawer (VM crashes, CDP-connection lifecycle, screenshots, and collector health all in one always-on bucket), a few events had no category and silently rode system, and system could not be turned off.

One consistent, intent-based category set

Events are categorized by what happened, and the set a caller configures equals the set they see:
console, network, page, interaction, control (was api), connection (CDP + live-view attach/detach), system (VM health only), screenshot, captcha, plus the auto-managed monitor (CDP-collector health).

Category is server-authoritative

kernel-images-api owns category assignment. A known event type is stamped with its canonical category from a generated CategoryForType lookup and any caller-supplied value is ignored; unknown custom types must carry an explicit category (400 otherwise). The lookup is generated from openapi.yaml via go:generate (wired into make oapi-generate), so the spec stays the single source of truth and CI catches drift on a dirty diff.

Behavior

  • Nothing is force-always-on. The filter is simply "you get the categories you enabled," with monitor riding along automatically whenever a CDP category is captured.
  • Default (empty config / enabled: true) = every category except screenshot (heavy base64, opt-in).
  • The CDP collector starts only when a CDP category is enabled, so a system-only subscriber (e.g. "tell me about OOM/crashes, not page activity") pays no collector cost.

Test plan

  • go build ./..., go vet ./..., gofmt clean
  • Unit tests pass for telemetry, events, cdpmonitor, devtoolsproxy, sysmon, supervisord-shim, and the API handlers (added coverage for server-authoritative category, unknown-type 400, screenshot opt-in, and monitor ride-along)
  • make oapi-generate regenerates oapi.go + category_gen.go cleanly
  • Pre-existing lib/recorder ffmpeg/pulse-audio test failures are unrelated (fail on main too)

Made with Cursor


Note

Medium Risk
Changes telemetry filtering, publish validation, and PUT/PATCH apply semantics across API and collectors; misconfiguration or rollback bugs could drop events or leave partial state until fixed.

Overview
This PR reworks browser telemetry categories so they match intent, are server-authoritative, and align with configurable capture (nothing is force-always-on).

Taxonomy: Replaces the old five-knob model (apicontrol, splits lifecycle out of system) with connection (CDP/live view), screenshot (opt-in), captcha, and auto monitor (CDP collector health, not a user toggle). OpenAPI, generated oapi, and producers (cdpmonitor, devtoolsproxy, middleware) emit the new categories.

Publish path: CategoryForType (generated from openapi.yaml via categorygen, wired into make oapi-generate) stamps known event types; custom types must send a valid category or get 400. Client category on known types is ignored.

Config & runtime: Empty PUT/PATCH resolves to DefaultCategories (all on except screenshot). Clearing requires every user category off. TelemetrySession filters only enabled categories and adds monitor when any CDP category is on. PUT/PATCH commit config then reconcileTelemetryState (CDP monitor + api_call middleware); failed collector start rolls back and can return 500 on PATCH too. Screenshots skip ffmpeg when the screenshot category is disabled.

Risk: Medium — broad telemetry/API behavior change and new failure modes on config apply; instance-side only per PR notes (public SDK/docs follow separately).

Reviewed by Cursor Bugbot for commit b1b4bd4. Bugbot is set up for automated code reviews on this repo. Configure here.

Split the overloaded `system` category into intent-based categories so the
set a caller configures matches the set they see on events. `system` now means
VM health only (oom kills, service crashes); client attach/detach lifecycle
moves to `connection`, periodic screenshots to `screenshot`, captcha outcomes
to `captcha`, CDP-collector health to the auto-managed `monitor`, and `api`
is renamed `control`.

kernel-images-api is now authoritative on category: a known event type is
assigned its canonical category from a generated `CategoryForType` lookup
(derived from openapi.yaml via `go:generate`), and any caller-supplied value
is ignored. Unknown custom types must carry an explicit category.

Behavior changes:
- Nothing is force-always-on. The publish filter is "you get the categories
  you enabled", with `monitor` riding along automatically whenever a CDP
  category is captured.
- Default (empty config / enabled:true) captures every category except
  `screenshot`, which is heavy base64 data and opt-in.
- The CDP collector starts only when a CDP category is enabled, so a
  system-only subscriber pays no collector cost.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread server/cmd/api/api/telemetry.go
@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Expands the browser telemetry event taxonomy from 5 to 9 categories and renames the api category to control. Callers that previously configured api: enabled: true/false must switch to control. A new opt-in screenshot category is added (default-off due to high volume). The default category set grows from 5 to 8 (adding connection, system, captcha), so sessions started without explicit config will now receive more event types.

Intended effect:

  • StreamBrowserTelemetry error rate: baseline 0% 400s and 0% 500s over Jun 1–2; confirmed if it remains at 0% post-deploy (no callers rejected for sending the old api category)
  • Telemetry config errors: baseline 0 "failed to apply telemetry" ERROR logs on Jun 2; confirmed if 0 new instances appear after deploy

Risks:

  • Callers sending old api categoryPublishTelemetryEvent 400 rate, alert if > 1% sustained (baseline 0%)
  • CDP monitor double-start via PATCH — any monitor_init_failed event type in telemetry stream or any 500 from PATCH /telemetry (new code path, previously 500 was not possible on PATCH)
  • screenshot accidentally enabledStreamBrowserTelemetry request volume spike > 2× active-hour baseline (~2,622/hr), would indicate base64 image flood
  • Session connection/captcha/system event surprise — callers filtering on category who relied on the 5-category default will receive 3 new event types; alert if new "unexpected event type" errors appear in customer-facing logs

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Reconcile the CDP collector before committing the new config instead of
after. The collector start is the only fallible step, so doing it first means
a failure returns 500 without mutating the session config or middleware, for
both fresh and already-active PUT/PATCH. Resolves the default category set in
telemetryConfigFromOAPI so the effective categories are known at reconcile
time. Addresses Bugbot review on #268.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread server/cmd/api/api/telemetry.go Outdated
Sayan- and others added 2 commits June 2, 2026 21:44
The prior atomicity fix reconciled the collector before committing the
config, which dropped any events the collector emitted in the window before
the session filter went live. Commit the config first (infallible) so the
filter is active before the collector starts, then reconcile, and roll back
fully to the prior config on a collector start failure. Reverting never needs
a fallible collector start, so rollback cannot fail. Addresses the second
Bugbot review on #268.

Co-authored-by: Cursor <cursoragent@cursor.com>
The CDP collector captures a screenshot via ffmpeg on page-load and
uncaught-exception events (throttled to once per 2s). With the screenshot
category off by default, those captures ran and were then dropped at the
telemetry filter, spending ffmpeg on output nobody receives. Pass a
screenshotEnabled predicate (wired to TelemetrySession.CategoryEnabled) into
the monitor and skip the capture entirely when the category is disabled.
A nil predicate always captures, preserving existing test behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Sayan-
Copy link
Copy Markdown
Contributor Author

Sayan- commented Jun 3, 2026

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b1b4bd4. Configure here.

@Sayan- Sayan- changed the title Rework telemetry event categories (server-authoritative) Rework telemetry event categories Jun 3, 2026
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one thought about control naming and whether it applies to all API calls in a way that makes sense

also remember to update openapi.yaml in control plane api once this is live

@Sayan- Sayan- merged commit d9d2147 into main Jun 3, 2026
10 checks passed
@Sayan- Sayan- deleted the telemetry-category-cleanup branch June 3, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants