Skip to content

feat(telemetry): roll up http requests#946

Open
EhabY wants to merge 4 commits intomainfrom
feat/issue-907-http-requests-rollup
Open

feat(telemetry): roll up http requests#946
EhabY wants to merge 4 commits intomainfrom
feat/issue-907-http-requests-rollup

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented May 8, 2026

Summary

  • Add http.requests telemetry rollups for active (method, route) buckets from the existing axios logging interceptor path.
  • Normalize routes to strip queries and replace IDs/user/workspace names before telemetry emission.
  • Add coder.telemetry.local.httpRequests.windowSeconds as an advanced JSON setting with a 60s default.
  • Wire telemetry into the primary and remote-scoped CoderApi clients while keeping throwaway clients opt-in.

Closes #907.

Validation

  • pnpm test:extension ./test/unit/settings/telemetry.test.ts ./test/unit/logging/httpRequestsTelemetry.test.ts ./test/unit/api/coderApi.test.ts
  • pnpm test:extension
  • pnpm format:check
  • pnpm lint
  • pnpm typecheck
Implementation plan and decisions

Decisions

  1. coder.telemetry.local is treated as advanced local telemetry runtime configuration, not just local sink configuration.
  2. The HTTP rollup setting lives at coder.telemetry.local.httpRequests.windowSeconds with a default of 60.
  3. Throwaway CoderApi instances remain telemetry-free by default; long-lived primary and remote-scoped API clients receive telemetry.
  4. HTTP aggregation hooks into the existing axios logging interceptor path rather than adding a parallel interceptor.
  5. Route normalization rules are kept alongside the aggregator for easy route maintenance.

Implementation

  • Added HttpRequestsTelemetry for tumbling-window aggregation and http.requests emission.
  • Added per-route normalization for IDs, numeric segments, user names, and workspace names.
  • Added a small TelemetryReporter interface/no-op reporter to decouple CoderApi from concrete telemetry service construction.
  • Extended coder.telemetry.local schema and runtime readers with httpRequests.windowSeconds.
  • Added tests for same-bucket rollup, status buckets, empty windows, p95, route normalization, settings parsing, and axios interceptor wiring.

Generated by Coder Agents.

@EhabY EhabY self-assigned this May 8, 2026
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 11, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition. The tumbling-window aggregation design is proportional to the problem, the TelemetryReporter interface extraction keeps coupling narrow, and the NOOP sentinel pattern for throwaway clients is clean. The test harness is well-structured with fake timers, meaningful assertions, and non-vacuous coverage.

Severity summary: 1 P2, 13 P3, 1 P4, 5 Nit, 2 Note.

The P2 is the UUID regex version nibble constraint ([1-5]), which rejects UUIDv6/v7/v8 in the fallback normalizer. Seven reviewers flagged it independently. The fix is one character.

Of the P3s, the most impactful cluster: dispose() drops accumulated data without flushing, the timer callback has no try/catch (a throw permanently kills the chain), and average()/percentile95() produce NaN/undefined on empty input. All are straightforward defensive fixes.

On the design side: the route normalization fallback passes name-bearing tail segments (like usernames after /members/) through unchanged, which partially contradicts the stated goal of stripping user/workspace names. And the interceptor registration order means telemetry records phantom errors for requests that are transparently retried by the cert-refresh or auth interceptors. Neither is urgent, but both are worth acknowledging.

One reviewer questioned whether the windowSeconds config knob earns its ~115 lines of schema, reader, watcher, update, and tests given nobody will change the default. Worth a human decision.

"A public surface with zero in-tree callers is a foot-gun, not dead code." (Kite, on the UUID fallback path)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/logging/httpRequestsTelemetry.ts Outdated
Comment thread src/logging/httpRequestsTelemetry.ts
Comment thread src/logging/httpRequestsTelemetry.ts
Comment thread src/logging/httpRequestsTelemetry.ts
Comment thread src/logging/httpRequestsTelemetry.ts Outdated
Comment thread src/api/coderApi.ts
Comment thread src/api/coderApi.ts Outdated
Comment thread src/logging/httpRequestsTelemetry.ts
Comment thread CHANGELOG.md Outdated
Comment thread test/unit/logging/httpRequestsTelemetry.test.ts
EhabY and others added 3 commits May 11, 2026 15:53
Drop the recorder interface, noop sentinels, and creation helper —
construct HttpRequestsTelemetry unconditionally and let the noop
reporter swallow .log() calls. Flatten route normalization rules,
switch buckets to a nested map (no string key encoding), and stop
pushing zero-duration samples when metadata is absent so p95 isn't
skewed. Drop the LOCAL_SINK_SETTING backwards-compat alias.
@EhabY EhabY force-pushed the feat/issue-907-http-requests-rollup branch from cd6bd55 to 64f09d1 Compare May 11, 2026 12:53
- UUID regex matches any version (v6/v7/v8 now normalize). Then dropped
  UUID/numeric heuristics entirely; route normalization uses explicit
  rules only. Added api/v2/workspaces/{id}/builds/{id} for the main
  composite case. Trailing literal segments still pass through.
- Flush the pending bucket on dispose so the last partial window is
  not lost.
- Drop the coder.telemetry.local.httpRequests.windowSeconds knob and
  hardcode 60s, removing the schema, reader, watcher, update method,
  and related tests.
- Rename addLoggingInterceptors -> addRequestInterceptors. Document
  that cert-refresh retries re-enter the chain so each attempt is
  recorded in telemetry.
- Add explicit rule for /api/v2/organizations/{id}/members/{name}.
- Use toSorted, drop the unused export, comment normalizeByRule.
- Drop "aggregated" from the CHANGELOG entry.
- Tests: dispose-flush, no-rule fallback, undefined / "" inputs.
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 11, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong response to R1. All 19 addressed findings verified: UUID regex eliminated entirely (class-of-bug fix), config surface removed, dispose flushes, try/finally protects the timer chain, empty-array guards return 0, naming and docs cleaned up. The normalization redesign (position-based rules, no UUID regex) is cleaner than what it replaced.

DEREM-20 (contested, interceptor ordering): panel closed 3/3 accept. The author's defense holds: each interceptor-level attempt is a real network event, cert-refresh frequency is bounded, and the added comment at coderApi.ts:520-522 documents the counting semantics.

Severity summary for new findings: 1 P2, 4 P3.

The P2 is a regression from the R1 refactor: removing the NOOP sentinel check means every CoderApi.create() now constructs a real HttpRequestsTelemetry with a running 60-second timer, including 5-6 throwaway call sites that never dispose. The fix is straightforward: lazy-start the timer on first record() instead of in the constructor.

"What happens when the extension runs for a week in a remote window?" (Hisoka, on the timer leak)

🤖 This review was automatically generated with Coder Agents.


public constructor(telemetry: TelemetryReporter) {
this.#telemetry = telemetry;
this.#scheduleNextWindow();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [DEREM-28] Constructor eagerly starts a 60-second repeating timer via this.#scheduleNextWindow(). Every CoderApi.create() call instantiates a real HttpRequestsTelemetry, including 5-6 throwaway call sites that never call dispose() (loginCoordinator.ts:194, loginCoordinator.ts:225, sessionManager.ts:291, authorizer.ts:67, container.ts:87, deploymentManager.ts:98). Each leaks a setTimeout chain that fires every 60s, iterates an empty bucket map, calls a NOOP log(), and reschedules forever. The timer closure prevents GC.

This is a regression from R1, where createHttpRequestsTelemetry checked telemetry === NOOP_TELEMETRY_REPORTER and returned a NOOP object with no timer. That sentinel was removed in the refactor.

prepareOAuthOperation runs on every OAuth token refresh (typically hourly). A day-long session accumulates ~24 leaked timers.

Fix (cheapest): defer #scheduleNextWindow() until the first #record() call. Clients that never record a request never start a timer:

#record(...): void {
  if (this.#disposed) { return; }
  if (!this.#timer) { this.#scheduleNextWindow(); }
  // ... rest of record
}

Remove the constructor call. (Hisoka P3, Kite P3)

🤖

);
}

public dispose(): void {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [DEREM-29] dispose() is not exception-safe. If #flush() throws at line 85:

  1. #disposed stays false (line 86 not reached)
  2. Timer is never cleared (lines 87-89 not reached)
  3. The throw propagates into CoderApi.dispose(), skipping WebSocket and socket-set cleanup

The timer callback already has try/finally (DEREM-19 fix). Apply the same pattern here:

public dispose(): void {
  if (this.#disposed) { return; }
  try { this.#flush(); } finally {
    this.#disposed = true;
    if (this.#timer) {
      clearTimeout(this.#timer);
      this.#timer = null;
    }
  }
}

Current TelemetryService.log() uses #safeEmit (catch-all), so this requires a future reporter implementation that doesn't swallow. Defensive fix, same justification as DEREM-19. (Mafuuu P3, Meruem P3, Razor P3)

🤖

EVENT_NAME,
{ method, route },
{
window_seconds: WINDOW_SECONDS,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [DEREM-30] Dispose-time flush emits window_seconds: 60 regardless of actual elapsed time since the last flush. When the extension deactivates mid-window, a consumer computing request rates from count / window_seconds will understate the rate for the final event.

This is the residual of DEREM-9 (which was closed because updateConfig was removed). The disposal case fires once at shutdown, so the practical impact is minor. Noting it because four reviewers flagged it independently and the fix (track window-start time, emit actual elapsed) would also benefit any future partial-window flush. (Chopper P3, Mafuuu Note, Kite Note, Meruem Note)

🤖

Comment thread src/telemetry/reporter.ts

import type { CallerMeasurements, CallerProperties } from "./event";

export interface TelemetryReporter {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [DEREM-31] TelemetryReporter interface has no doc comment explaining why it exists. This is the new abstraction boundary between CoderApi and TelemetryService. A developer opening this file sees three method signatures and a NOOP constant with no explanation of the decoupling contract.

Similarly, HttpRequestsTelemetry (httpRequestsTelemetry.ts:51) has no doc comment. It starts a repeating timer in its constructor (a side effect not obvious from the class name), recordError silently drops non-Axios errors, and dispose flushes pending data. These are preconditions and postconditions that belong on the class.

Suggested:

// reporter.ts
/** Decouples callers from the concrete TelemetryService so throwaway
 *  or telemetry-disabled clients can use NOOP_TELEMETRY_REPORTER
 *  without constructing a full service. */

// httpRequestsTelemetry.ts
/** Aggregates HTTP request counts and latencies into 60-second tumbling
 *  windows and emits "http.requests" events via the provided reporter.
 *  Starts a repeating flush timer on construction. Call dispose() to
 *  flush pending data and stop the timer. */

(Leorio)

🤖

Comment thread src/api/coderApi.ts
const client = new CoderApi(output);
const httpRequestsTelemetry = new HttpRequestsTelemetry(telemetry);
const client = new CoderApi(output, httpRequestsTelemetry);
client.setCredentials(baseUrl, token);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [DEREM-32] CoderApi.create doc comment says "Automatically sets up logging interceptors and certificate handling." This PR added a telemetry parameter and wires HttpRequestsTelemetry into the instance. The comment now lies by omission. Someone writing a second factory path or test helper might skip the telemetry parameter thinking it is cosmetic.

Add "and HTTP request telemetry" to the list. (Leorio P3, Gon Nit)

🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: http.requests rollup

1 participant