Skip to content

feat(cdk): add CloudWatch alarms for FanOut + ApprovalMetricsPublisher DLQs#208

Open
nizar-lahlali wants to merge 3 commits into
mainfrom
feat/117-dlq-cloudwatch-alarms
Open

feat(cdk): add CloudWatch alarms for FanOut + ApprovalMetricsPublisher DLQs#208
nizar-lahlali wants to merge 3 commits into
mainfrom
feat/117-dlq-cloudwatch-alarms

Conversation

@nizar-lahlali
Copy link
Copy Markdown
Contributor

Summary

  • Adds ApproximateNumberOfMessagesVisible >= 1 CloudWatch alarms on both FanOutDlq and ApprovalMetricsPublisherDlq (5-min window, Maximum statistic)
  • Exposes alarms as public readonly dlqAlarm for future SNS topic wiring
  • Adds construct-level unit tests verifying alarm presence and threshold configuration

Closes #117

Test plan

  • tsc --noEmit compiles cleanly
  • fanout-consumer.test.ts passes (8 tests including new alarm test)
  • approval-metrics-publisher-consumer.test.ts passes (8 tests including new alarm test)
  • CI green

…r DLQs (#117)

Adds ApproximateNumberOfMessagesVisible >= 1 alarms (5-min window,
Maximum statistic) on both DLQs so poison-pill records don't silently
accumulate. Each alarm includes a runbook pointer in alarmDescription
and is exposed as a public property for future SNS topic wiring.
@nizar-lahlali nizar-lahlali requested a review from a team as a code owner May 28, 2026 20:46
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented May 28, 2026

Review summary 🧾

Thanks for picking up #117 — silent DLQ accumulation is exactly the kind of invisible failure ABCA should surface, and the alarm configuration itself is clean and idiomatic (correct metric, Maximum/5-min window, >= 1 threshold, NOT_BREACHING). Tests pass locally (14/14). A few things to resolve before this is merge-ready.

Verdict: Request changes — the code is sound, but the change needs to be reconciled with the design decision it reverses and the operator copy finished.

Blocking 🚧

  1. Governance — issue Add CloudWatch alarms for FanOutConsumer + ApprovalMetricsPublisher DLQs #117 isn't approved. The issue currently has no labels (a member comment even flagged that P3/enhancement couldn't be applied and asked a triager to add them). Per ADR-003, the approved label needs to land before merge.

  2. Doc drift — §11.5 says the opposite of what we're shipping. docs/design/CEDAR_HITL_GATES.md §11.5 (lines 1588–1601) and the roadmap note at line 2073 still describe these DLQ alarms as deferred / out-of-scope for v1 until a notification channel exists — and the new code comments cite §11.5 as justification for the reverse. Since the alarms ship with no addAlarmAction/SNS target, they currently reproduce the exact "fires into the void" condition §11.5 warned about. Please update §11.5 + line 2073 to record this decision (alarms now ship without a notification action, SNS wiring as follow-up), then regenerate the Starlight mirror via cd docs && node scripts/sync-starlight.mjs and commit it.

  3. Runbook: TODO in operator-facing alarmDescription (fanout-consumer.ts:200, approval-metrics-publisher-consumer.ts:166). This string shows up in the CloudWatch console / notifications, so a TODO reads as unfinished during an incident. The diagnostic sentence after it is genuinely useful — suggest dropping the TODO token or pointing at Add CloudWatch alarms for FanOutConsumer + ApprovalMetricsPublisher DLQs #117.

Non-blocking / nits 💡

  • Misleading // §11.5: comments and (§11.5) test titles — same root as Docs: specify that you can't use the agent with the canned repo #2; reword to reference the implementing issue once the doc is updated.
  • dlqAlarm is unreachable today — both constructs are instantiated in agent.ts (lines 672, 696) without capturing the reference, so the "for future SNS wiring" property has no consumer yet. Mirrors the existing (also-unconsumed) errorAlarm, so not a regression — but consider wiring the SNS action here, or keeping the alarm private until a consumer exists.
  • Narrow the type to cloudwatch.IAlarmaddAlarmAction lives on IAlarm; exposing the concrete Alarm leaks the full mutable surface.
  • Add a JSDoc to dlqAlarm to match the errorAlarm precedent in task-orchestrator.ts:162.
  • Removed deferral rationale — the deleted comment explaining why alarms were deferred is still relevant (the SNS precondition is unmet); consider a one-liner noting the alarm intentionally has no action yet.

Tests 🧪

Coverage of the scalar alarm config is solid. Two cheap additions worth making while you're here — they guard the most plausible failure given the two constructs are near-identical copies:

  1. Assert the QueueName dimension so the alarm is pinned to its own DLQ — nothing currently catches a metric wired to the wrong queue.
  2. template.resourceCountIs('AWS::CloudWatch::Alarm', 1) — matches the count-assertion convention the rest of these test files already use.

Review agents run

code-reviewer, type-design-analyzer, comment-analyzer, pr-test-analyzer ✅ — silent-failure-hunter and /security-review omitted (no error-handling and no IAM/Cedar/network/secrets surface in the diff).

Really nice, focused change overall — once the doc reconciliation and runbook text are sorted it'll be in good shape. 🙏

@krokoko krokoko self-requested a review May 28, 2026 22:27
Copy link
Copy Markdown
Contributor

@krokoko krokoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CloudWatch alarms for FanOutConsumer + ApprovalMetricsPublisher DLQs

2 participants