Skip to content

Add Bugsnag error grouping with stable normalized keys#234

Merged
morgan-wowk merged 1 commit into
masterfrom
bugsnag/error-grouping
May 15, 2026
Merged

Add Bugsnag error grouping with stable normalized keys#234
morgan-wowk merged 1 commit into
masterfrom
bugsnag/error-grouping

Conversation

@morgan-wowk
Copy link
Copy Markdown
Collaborator

@morgan-wowk morgan-wowk commented May 9, 2026

Add Bugsnag error grouping with stable normalized keys

Introduces configurable error grouping so structurally identical exceptions collapse into a single group rather than creating a new entry per unique pod name, UUID, or memory address.

How it works

A new TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY env var controls the metadata key name written on each Bugsnag event. When unset the feature is a complete no-op. When set (by a deployment), every notified exception gets a custom[<key>] tab containing a normalized string derived from the exception type and message.

System errors reported through record_system_error_exception are additionally prefixed with SYSTEM_ERROR: so they can be filtered or grouped separately from non-system errors.

Error taxonomy

The following exception types, from one consumer use case, are normalized to stable grouping keys:

Group Normalized key
k8s pod not found kubernetes ApiException (404): NotFound: pods "{pod}" not found
k8s container terminated kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is terminated
k8s pod initializing kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is waiting to start: PodInitializing
k8s container not available kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is not available
k8s webhook timeout kubernetes ApiException (500): InternalError: failed calling webhook "<>": context deadline exceeded
UnicodeDecodeError UnicodeDecodeError: 'utf-8' codec can't decode byte at position {n}
MaxRetryError MaxRetryError: k8s connection pool max retries exceeded (ReadTimeoutError)
OrchestratorError OrchestratorError: Unexpected running container status: {object}
Fallback ExceptionType: {message with addresses/UUIDs/IDs stripped}

Many exception types (e.g. AttributeError, sqlalchemy.exc.OperationalError) already produce stable messages and pass through the fallback unchanged.

Changes

  • error_normalization.py (new) — one public function normalize_error_message(*, exception) dispatching to type-specific handlers before falling back to a generic stripper that removes hex addresses, UUIDs, and long alphanumeric IDs
  • bugsnag_instrumentation.py — reads TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY; _before_notify attaches the normalized key when configured; supports an optional grouping_prefix passed through notify(**metadata)
  • orchestrator_sql.pyrecord_system_error_exception passes grouping_prefix="SYSTEM_ERROR" so system errors are visually distinct
  • test_error_normalization.py (new) — 15 unit tests covering all error groups and the fallback path

OSS note

The grouping key name is not hardcoded — it is supplied entirely via TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY at deploy time, so no internal platform names appear in OSS code.

Copy link
Copy Markdown
Collaborator Author

morgan-wowk commented May 9, 2026

@morgan-wowk morgan-wowk marked this pull request as ready for review May 9, 2026 01:00
@morgan-wowk morgan-wowk requested a review from Ark-kun as a code owner May 9, 2026 01:00
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch 2 times, most recently from 754d49c to 148b0ab Compare May 9, 2026 02:27
import re

_POD_NAME_PATTERN = re.compile(r"task-[a-zA-Z0-9]+-[a-zA-Z0-9]+")
_OBJECT_REPR_PATTERN = re.compile(r"<[^>]+ object at 0x[0-9a-fA-F]+>")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we might want to fix such address strings (if they are not informative).

import json
import re

_POD_NAME_PATTERN = re.compile(r"task-[a-zA-Z0-9]+-[a-zA-Z0-9]+")
Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another prefix: tangle-ce-`` Should probably add tangle- too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added coverage for tangle-ce- and tangle- in general

Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea for noise reduction. Thank you.

@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 148b0ab to 641aba2 Compare May 13, 2026 01:03
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from c0ba47b to d653ac8 Compare May 13, 2026 01:25
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 641aba2 to 3ffdb41 Compare May 13, 2026 01:25
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from d653ac8 to 5e2e55c Compare May 13, 2026 01:47
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 3ffdb41 to bd437a4 Compare May 13, 2026 01:47
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from 5e2e55c to ca80b1b Compare May 13, 2026 18:43
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from bd437a4 to 9894484 Compare May 13, 2026 18:43
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from ca80b1b to 9a21565 Compare May 13, 2026 20:41
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 9894484 to 61efdcb Compare May 13, 2026 20:41
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from 9a21565 to c4c2528 Compare May 13, 2026 20:54
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 61efdcb to 40b58e6 Compare May 13, 2026 20:54
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from c4c2528 to 4d63d7f Compare May 13, 2026 21:26
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch 2 times, most recently from 590ae88 to e742344 Compare May 13, 2026 23:09
@morgan-wowk morgan-wowk force-pushed the bugsnag/orchestrator-integration branch from 4d63d7f to 41bd6af Compare May 13, 2026 23:20
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch 3 times, most recently from 8239e55 to da294f0 Compare May 14, 2026 00:25
Copy link
Copy Markdown
Collaborator Author

Verified this stack against known consumers ✅

Error grouping is working as intended

key_value = f"{prefix}: {normalized}" if prefix else normalized
event.add_tab("custom", {_CUSTOM_GROUPING_KEY: key_value})
if prefix and event.errors:
try:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For extra safety against potential package changes in the future, this is wrapped in a try catch with a fallback to no prefixing.

Copy link
Copy Markdown
Collaborator Author

morgan-wowk commented May 15, 2026

Merge activity

  • May 15, 5:05 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • May 15, 5:12 PM UTC: Graphite rebased this pull request as part of a merge.
  • May 15, 5:13 PM UTC: @morgan-wowk merged this pull request with Graphite.

@morgan-wowk morgan-wowk changed the base branch from bugsnag/orchestrator-integration to graphite-base/234 May 15, 2026 17:10
@morgan-wowk morgan-wowk changed the base branch from graphite-base/234 to master May 15, 2026 17:11
Introduces error_normalization.py which strips instance-specific values
(pod names, IDs, memory addresses, byte offsets) from exceptions so
structurally identical errors collapse to one group in Bugsnag.

TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY controls the metadata key name — no-op
when unset, allowing Shopify deployments to set it without touching OSS code.
System errors reported via record_system_error_exception are prefixed with
"SYSTEM_ERROR: " for easy filtering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from da294f0 to f0880fe Compare May 15, 2026 17:12
"""Tests for error_normalization module."""

import json
import unittest.mock as mock
@morgan-wowk morgan-wowk merged commit e08851a into master May 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants