Conversation
f726148 to
2758a44
Compare
2758a44 to
da3d8dc
Compare
620ce09 to
495e0c1
Compare
- Bundle activation tracer methods (setAuthState, traceDeploymentInit) into a span-scoped ActivationTracer interface; drop instance-state from ActivationTelemetry - Mirror the same shape for remote.setup: RemoteSetupTracer exposes phase() and is owned by RemoteSetupTelemetry; remote.ts no longer touches Span directly - Build RemoteSetupContext after auth retrieval (no empty-default mutation); group setup args under args; rename baseUrlRaw -> baseUrl - Reuse AuthorityParts from util instead of synthesizing a local alias - Drop redundant post-download recordDownloadedBytes and the BinaryDownloadResult/DownloadResult types; per-chunk progress already records the final value - Adopt createTestTelemetryService helper in commandManager.test.ts
495e0c1 to
374c156
Compare
|
/coder-agents-review |
There was a problem hiding this comment.
The telemetry infrastructure in this PR is well-designed. The Span.setProperty/setMeasurement API with shallow-copy isolation, the typed phase helpers, and the createTestTelemetryService consolidation are all clean work. The CLI download tests are thorough, covering success, cache hit, 304, partial failure, and verification. The remote.ts decomposition into focused private methods is a clear improvement.
The recurring theme across findings is that the instrumentation measures mechanism (throw/no-throw) rather than outcome. Several user-visible failure paths, including workspace not found, user cancels login, incompatible server, and signature bypass, all return normally and emit result: "success". The authState value on the activation span claims validation that only storage retrieval has performed. These issues converge: the telemetry exists to diagnose startup problems, but the most common failure modes are invisible to it.
Severity breakdown: 3 P2, 6 P3, 2 P4, 4 Nit.
"telemetry cannot distinguish verified binaries from user-bypassed verification. What percentage of your users are running unverified binaries? The telemetry says zero." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Address PR review feedback on the startup telemetry instrumentation: - Hoist auth retrieval and the no-auth check out of the `remote.setup` trace; auth retrieval is now its own `remote.auth_retrieval` top-level event so login retries no longer produce nested `remote.setup` events with dialog time in `durationMs`. - Add `RemoteSetupTracer.setOutcome` and annotate the non-throwing failure paths (`workspace_not_found`, `incompatible_server`). - Distinguish verified, bypassed, and sig-not-found verification outcomes; verify is now a `cli.download.verify` child phase. - Add `unreadable` download reason for the case where a binary exists but `cliVersion()` throws. - Use `stored` for an unvalidated stored token; reserve `valid_token` for post-validation. Default `unknown` on `activation.deployment_init` before `initFn` runs. - Track and warn on post-emit span mutations; eliminate the only realistic case by accumulating download bytes locally and emitting the measurement once before `performBinaryDownload` returns. - Rename `findReadyAgent` -> `resolveAgent`, phase `agent_ready` -> `resolve_agent`, parameter `setupContext` -> `context`. - Add unit tests for `ActivationTelemetry` and `RemoteSetupTelemetry`.
|
/coder-agents-review |
There was a problem hiding this comment.
All 18 R1 findings addressed or acknowledged. Fixes verified by Meruem (DEREM-7, -10, -11), Razor (DEREM-4, -5, -6, -7, -8, -11), Mafuuu (DEREM-6, -8). The span lifecycle, instrumentation wrappers, and remote setup refactor are structurally sound. New test coverage is thorough and proportional (activation: 7 tests, remote setup: 4 tests, CLI telemetry: 8 tests, span mutations: 3 tests).
Three new findings this round, all in the same class as R1's core theme (telemetry labels that assert more than the code measures). None are blocking.
Severity: 2 P3, 1 P4.
PS. The PR description still lists auth_retrieval as a hierarchical child phase of remote.setup, but the code now emits it as a standalone remote.auth_retrieval trace. Worth updating before merge so the description matches the shipped behavior.
"operator investigating token lifetimes when the cause is DNS outage" (Mafuuu)
🤖 This review was automatically generated with Coder Agents.
Address R2 review feedback: - Rename `authState: "expired"` -> `"auth_failed"`. The label no longer asserts token expiration when `setDeploymentIfValid` returns false for any failure (401, network/DNS, cert, server errors). - Record `sigStatus` (HTTP status) alongside `outcome: "sig_not_found"` on `cli.download.verify`, distinguishing 404 (no signatures published) from 5xx (signature download failure) without changing the outcome label. - Cover the `unreadable` download reason in the parametric `cli.download` telemetry test.
Summary
authState, duration, and result; activation no longer emits granular phase properties.activation.deployment_init.remote.auth_retrievalevent so every setup attempt records the metric (including attempts that exit beforeremote.setupruns).remote.setupwith hierarchical child phases:workspace_lookup,workspace_ready,resolve_agent, andssh_config_write.cli.download.verify, a child phase ofcli.download), including download reason,downloadedBytes, verify outcome, and signature HTTP status when applicable.Notes
remote.setup.workspace_lookuprather thanremote.setup.phasewith a phase property.RemoteSetupTelemetryexposes a single typedphase(...)helper plussetOutcome(...)for non-throwing failure paths (workspace_not_found,incompatible_server).cli.download.reasonis one ofmissing,version_mismatch, orunreadable(binary present butcliVersion()threw).cli.download.verify.outcomeis one ofverified,bypassed, orsig_not_found. Whensig_not_found,sigStatusrecords the HTTP status of the signature download (e.g.,404vs500).cli.download.measurements.downloadedBytesis emitted when bytes are written, including failed downloads that wrote a partial body; omitted when the server returns HTTP 304.activation.authStateisnone/stored/valid_token/auth_failed/unknown.auth_failedcovers any validation failure sincesetDeploymentIfValidreturns a boolean (it does not distinguish 401 from network/DNS/cert/server errors).remote.setupevents with dialog wait time indurationMs.Tests
pnpm lintpnpm typecheckpnpm test:extension ./test/unit/core/cliManager.test.tspnpm test:extension ./test/unit/core/cliManager.concurrent.test.tspnpm test:extension ./test/unit/telemetry/service.test.tspnpm test:extension ./test/unit/instrumentation/pnpm format:checkCloses #904.
Implementation plan and decisions
Decisions
remote.setup.workspace_lookup, notremote.setup.phasewith aphaseproperty.activation.deployment_init(a sibling trace, since deployment init outlives the activation span).authState,durationMs, andresult.remote.auth_retrievaltop-level trace rather than a child phase, so every setup attempt records the metric and a login-required retry does not nest traces.downloadedBytesonce before the trace closes; partial failed downloads retain the last written byte count. HTTP 304 emitscli.downloadbut omits the byte measurement.cli.downloadreasons use what the current code can determine:missing,version_mismatch,unreadable. There is no trueforceddownload path today.cli.download.verifyis a child phase ofcli.downloadso it sharestraceIdand carriesparentEventId.cli.download.durationMsincludes verify time; subtract the child duration when computing throughput.ServiceContaineris still constructed beforeinitVscodeProposed; construction does not call proposed APIs, and proposed API usage remains after initialization.Implementation
TelemetryServiceintoCliManagerthroughServiceContainer.Span.setPropertyandSpan.setMeasurement, withTelemetryServicecopying caller-owned objects at span start. Track a per-spancompletedflag and warn on post-emit mutations.TelemetryService.trace("activation", ...), using boundedauthStatevalues.Remote.setup()withTelemetryService.trace("remote.setup", ...)and child phases. Hoist auth retrieval and the no-auth check out of the trace.remote.tswhile calling helper methods without telemetry where possible.cli.downloadand verify ascli.download.verify.missing/version_mismatch/unreadable), partial download error telemetry, verify outcomes (verified/bypassed/sig_not_found× HTTP status), and the activation/remote-setup orchestration layers.Generated by Coder Agents.