Skip to content

feat(telemetry): instrument startup#943

Open
EhabY wants to merge 5 commits intomainfrom
feat/issue-904-startup-telemetry
Open

feat(telemetry): instrument startup#943
EhabY wants to merge 5 commits intomainfrom
feat/issue-904-startup-telemetry

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented May 7, 2026

Summary

  • Trace activation startup with authState, duration, and result; activation no longer emits granular phase properties.
  • Trace fire-and-forget stored deployment initialization separately as activation.deployment_init.
  • Trace auth retrieval as a standalone remote.auth_retrieval event so every setup attempt records the metric (including attempts that exit before remote.setup runs).
  • Trace remote.setup with hierarchical child phases: workspace_lookup, workspace_ready, resolve_agent, and ssh_config_write.
  • Trace CLI binary download and signature verification (cli.download.verify, a child phase of cli.download), including download reason, downloadedBytes, verify outcome, and signature HTTP status when applicable.
  • Add span-owned property/measurement mutators so telemetry can record values discovered during traced work without mutating caller-owned objects. Post-emit mutations are dropped and logged.

Notes

  • Remote phases emit as hierarchical event names like remote.setup.workspace_lookup rather than remote.setup.phase with a phase property.
  • RemoteSetupTelemetry exposes a single typed phase(...) helper plus setOutcome(...) for non-throwing failure paths (workspace_not_found, incompatible_server).
  • cli.download.reason is one of missing, version_mismatch, or unreadable (binary present but cliVersion() threw).
  • cli.download.verify.outcome is one of verified, bypassed, or sig_not_found. When sig_not_found, sigStatus records the HTTP status of the signature download (e.g., 404 vs 500).
  • cli.download.measurements.downloadedBytes is emitted when bytes are written, including failed downloads that wrote a partial body; omitted when the server returns HTTP 304.
  • activation.authState is none / stored / valid_token / auth_failed / unknown. auth_failed covers any validation failure since setDeploymentIfValid returns a boolean (it does not distinguish 401 from network/DNS/cert/server errors).
  • Auth retrieval and the no-auth check run before any trace so a login retry does not produce nested remote.setup events with dialog wait time in durationMs.

Tests

  • pnpm lint
  • pnpm typecheck
  • pnpm test:extension ./test/unit/core/cliManager.test.ts
  • pnpm test:extension ./test/unit/core/cliManager.concurrent.test.ts
  • pnpm test:extension ./test/unit/telemetry/service.test.ts
  • pnpm test:extension ./test/unit/instrumentation/
  • pnpm format:check

Closes #904.

Implementation plan and decisions

Decisions

  1. Remote setup phases follow the current telemetry architecture: hierarchical phase event names such as remote.setup.workspace_lookup, not remote.setup.phase with a phase property.
  2. Activation telemetry preserves existing behavior. Stored deployment initialization remains fire-and-forget and is instrumented separately as activation.deployment_init (a sibling trace, since deployment init outlives the activation span).
  3. Activation phases were removed from the implementation; the activation event records bounded authState, durationMs, and result.
  4. Auth retrieval is a standalone remote.auth_retrieval top-level trace rather than a child phase, so every setup attempt records the metric and a login-required retry does not nest traces.
  5. CLI download byte telemetry records downloadedBytes once before the trace closes; partial failed downloads retain the last written byte count. HTTP 304 emits cli.download but omits the byte measurement.
  6. cli.download reasons use what the current code can determine: missing, version_mismatch, unreadable. There is no true forced download path today.
  7. cli.download.verify is a child phase of cli.download so it shares traceId and carries parentEventId. cli.download.durationMs includes verify time; subtract the child duration when computing throughput.
  8. ServiceContainer is still constructed before initVscodeProposed; construction does not call proposed APIs, and proposed API usage remains after initialization.

Implementation

  • Wire TelemetryService into CliManager through ServiceContainer.
  • Add Span.setProperty and Span.setMeasurement, with TelemetryService copying caller-owned objects at span start. Track a per-span completed flag and warn on post-emit mutations.
  • Split startup instrumentation helpers into activation and remote setup modules.
  • Instrument activation with TelemetryService.trace("activation", ...), using bounded authState values.
  • Preserve fire-and-forget deployment initialization and attach telemetry via the existing promise chain.
  • Instrument Remote.setup() with TelemetryService.trace("remote.setup", ...) and child phases. Hoist auth retrieval and the no-auth check out of the trace.
  • Keep phase wrapping at the orchestration level in remote.ts while calling helper methods without telemetry where possible.
  • Instrument actual CLI download attempts with cli.download and verify as cli.download.verify.
  • Add targeted unit tests around span mutation isolation, post-emit warning, CLI download measurements (missing / version_mismatch / unreadable), partial download error telemetry, verify outcomes (verified / bypassed / sig_not_found × HTTP status), and the activation/remote-setup orchestration layers.

Generated by Coder Agents.

@EhabY EhabY self-assigned this May 7, 2026
@EhabY EhabY force-pushed the feat/issue-904-startup-telemetry branch 2 times, most recently from f726148 to 2758a44 Compare May 7, 2026 12:07
@EhabY EhabY changed the title feat: instrument startup telemetry feat(telemetry): instrument startup telemetry May 7, 2026
@EhabY EhabY changed the title feat(telemetry): instrument startup telemetry feat(telemetry): instrument startup May 7, 2026
@EhabY EhabY force-pushed the feat/issue-904-startup-telemetry branch from 2758a44 to da3d8dc Compare May 7, 2026 12:53
@EhabY EhabY force-pushed the feat/issue-904-startup-telemetry branch 4 times, most recently from 620ce09 to 495e0c1 Compare May 7, 2026 14:28
EhabY and others added 2 commits May 7, 2026 18:07
- Bundle activation tracer methods (setAuthState, traceDeploymentInit) into a span-scoped ActivationTracer interface; drop instance-state from ActivationTelemetry
- Mirror the same shape for remote.setup: RemoteSetupTracer exposes phase() and is owned by RemoteSetupTelemetry; remote.ts no longer touches Span directly
- Build RemoteSetupContext after auth retrieval (no empty-default mutation); group setup args under args; rename baseUrlRaw -> baseUrl
- Reuse AuthorityParts from util instead of synthesizing a local alias
- Drop redundant post-download recordDownloadedBytes and the BinaryDownloadResult/DownloadResult types; per-chunk progress already records the final value
- Adopt createTestTelemetryService helper in commandManager.test.ts
@EhabY EhabY force-pushed the feat/issue-904-startup-telemetry branch from 495e0c1 to 374c156 Compare May 7, 2026 15:08
@EhabY EhabY requested a review from ethanndickson May 7, 2026 17:01
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 8, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telemetry infrastructure in this PR is well-designed. The Span.setProperty/setMeasurement API with shallow-copy isolation, the typed phase helpers, and the createTestTelemetryService consolidation are all clean work. The CLI download tests are thorough, covering success, cache hit, 304, partial failure, and verification. The remote.ts decomposition into focused private methods is a clear improvement.

The recurring theme across findings is that the instrumentation measures mechanism (throw/no-throw) rather than outcome. Several user-visible failure paths, including workspace not found, user cancels login, incompatible server, and signature bypass, all return normally and emit result: "success". The authState value on the activation span claims validation that only storage retrieval has performed. These issues converge: the telemetry exists to diagnose startup problems, but the most common failure modes are invisible to it.

Severity breakdown: 3 P2, 6 P3, 2 P4, 4 Nit.

"telemetry cannot distinguish verified binaries from user-bypassed verification. What percentage of your users are running unverified binaries? The telemetry says zero." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/remote/remote.ts
Comment thread src/instrumentation/activation.ts
Comment thread src/remote/remote.ts Outdated
Comment thread src/instrumentation/remoteSetup.ts
Comment thread src/core/cliManager.ts Outdated
Comment thread src/instrumentation/activation.ts
Comment thread src/remote/remote.ts Outdated
Comment thread src/remote/remote.ts Outdated
Comment thread src/remote/remote.ts
Comment thread src/instrumentation/activation.ts
Address PR review feedback on the startup telemetry instrumentation:

- Hoist auth retrieval and the no-auth check out of the `remote.setup`
  trace; auth retrieval is now its own `remote.auth_retrieval` top-level
  event so login retries no longer produce nested `remote.setup` events
  with dialog time in `durationMs`.
- Add `RemoteSetupTracer.setOutcome` and annotate the non-throwing
  failure paths (`workspace_not_found`, `incompatible_server`).
- Distinguish verified, bypassed, and sig-not-found verification
  outcomes; verify is now a `cli.download.verify` child phase.
- Add `unreadable` download reason for the case where a binary exists
  but `cliVersion()` throws.
- Use `stored` for an unvalidated stored token; reserve `valid_token`
  for post-validation. Default `unknown` on `activation.deployment_init`
  before `initFn` runs.
- Track and warn on post-emit span mutations; eliminate the only
  realistic case by accumulating download bytes locally and emitting
  the measurement once before `performBinaryDownload` returns.
- Rename `findReadyAgent` -> `resolveAgent`, phase `agent_ready` ->
  `resolve_agent`, parameter `setupContext` -> `context`.
- Add unit tests for `ActivationTelemetry` and `RemoteSetupTelemetry`.
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 8, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 18 R1 findings addressed or acknowledged. Fixes verified by Meruem (DEREM-7, -10, -11), Razor (DEREM-4, -5, -6, -7, -8, -11), Mafuuu (DEREM-6, -8). The span lifecycle, instrumentation wrappers, and remote setup refactor are structurally sound. New test coverage is thorough and proportional (activation: 7 tests, remote setup: 4 tests, CLI telemetry: 8 tests, span mutations: 3 tests).

Three new findings this round, all in the same class as R1's core theme (telemetry labels that assert more than the code measures). None are blocking.

Severity: 2 P3, 1 P4.

PS. The PR description still lists auth_retrieval as a hierarchical child phase of remote.setup, but the code now emits it as a standalone remote.auth_retrieval trace. Worth updating before merge so the description matches the shipped behavior.

"operator investigating token lifetimes when the cause is DNS outage" (Mafuuu)

🤖 This review was automatically generated with Coder Agents.

Comment thread test/unit/core/cliManager.test.ts
Comment thread src/extension.ts Outdated
Comment thread src/core/cliManager.ts Outdated
Address R2 review feedback:

- Rename `authState: "expired"` -> `"auth_failed"`. The label no longer
  asserts token expiration when `setDeploymentIfValid` returns false for
  any failure (401, network/DNS, cert, server errors).
- Record `sigStatus` (HTTP status) alongside `outcome: "sig_not_found"`
  on `cli.download.verify`, distinguishing 404 (no signatures published)
  from 5xx (signature download failure) without changing the outcome
  label.
- Cover the `unreadable` download reason in the parametric `cli.download`
  telemetry test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: instrument startup path (activation, remote setup, CLI binary)

1 participant