Skip to content

[Backport 7.78.x] Fix pysnmp 7.x API compatibility and asyncio event-loop race condition#23158

Merged
dkirov-dd merged 2 commits into7.78.xfrom
backport-23125-to-7.78.x
Apr 7, 2026
Merged

[Backport 7.78.x] Fix pysnmp 7.x API compatibility and asyncio event-loop race condition#23158
dkirov-dd merged 2 commits into7.78.xfrom
backport-23125-to-7.78.x

Conversation

@dd-octo-sts
Copy link
Copy Markdown
Contributor

@dd-octo-sts dd-octo-sts bot commented Apr 3, 2026

Backport 7f3c485 from #23125.


Summary

Companion to #23124. Fixes all breakage introduced by the pysnmp 5.x → 7.x upgrade:

  • Migrates imports and call sites to the renamed 7.x API (sendVarBindssend_varbinds, runDispatcherrun_dispatcher, etc.)
  • Fixes varBindTable handling — pysnmp 7.x returns a flat list of (OID, val) pairs instead of nested rows
  • Fixes the event loop never stopping after an SNMP response (callbacks now call loop.stop())
  • Fixes MIB sharing, constraint enforcement, and thread event-loop setup in the discovery thread
  • Fixes asyncio event-loop race condition in discovery worker threads: pysnmp 7.x's AsyncioDispatcher captures asyncio.get_event_loop() at construction time. When multiple ThreadPoolExecutor workers shared the same loop, run_dispatcher() returned immediately on all but the first worker (loop.is_running() == True), leaving ctx={} and raising KeyError('error'). Each worker now creates its own isolated event loop and re-creates the SnmpEngine within that context.

Motivating issue: VULN-59246

#23125)

* Bump pyasn1 from 0.4.8 to 0.6.3 to fix CVE-2026-30922

CVE-2026-30922 (CVSS 7.5 High): pyasn1 prior to 0.6.3 is vulnerable
to DoS via unbounded recursion when decoding deeply nested ASN.1
structures. The fix introduces MAX_NESTING_DEPTH=100.

VULN-59246

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add changelog entry for pyasn1 bump (PR #23095)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update license metadata for pyasn1 0.6.3; fix orjson license parsing

- pyasn1 0.6.3 changed from BSD-3-Clause to BSD-2-Clause
- Add orjson override to .ddev/config.toml to handle its compound
  MPL-2.0 AND (Apache-2.0 OR MIT) license expression

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Bump pysnmp 5.1.0 to 7.1.22 to fix ModuleNotFoundError with pyasn1 0.6.3

pyasn1 0.6.x removed the `compat` module; pysnmp 5.x imported from it
(pysnmp/debug.py: `from pyasn1.compat.octets import octs2ints`). pysnmp
7.1.22 drops this import and explicitly supports pyasn1 0.6.3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update pysnmp imports to 7.x API

pysnmp 7.x removed the asyncore backend; update import paths:
- CommunityData and friends: pysnmp.hlapi → pysnmp.hlapi.v3arch.asyncio
- lcd singleton: pysnmp.hlapi.asyncore.cmdgen → instantiate from
  pysnmp.hlapi.v3arch.asyncio.lcd.CommandGeneratorLcdConfigurator
- vbProcessor singleton: pysnmp.hlapi.asyncore.cmdgen → instantiate from
  pysnmp.hlapi.varbinds.CommandGeneratorVarBinds

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix import ordering after pysnmp 7.x API migration

Move module-level singleton instantiations (lcd, vbProcessor) to after
all import statements to satisfy E402 lint rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pysnmp 7.x API incompatibilities

pysnmp 7.x removed the re-export of auth constants and ObjectType from
pysnmp.hlapi; use pysnmp.hlapi.v3arch.asyncio as the hlapi module alias
so getattr(hlapi, protocol_name) and hlapi.ObjectType still work.

UdpTransportTarget.create() is now async-only in 7.x; resolve the
transport address synchronously via socket.getaddrinfo() to preserve
the synchronous register_device_target() call path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pysnmp 7.x renamed attributes

- CommunityData/UsmUserData: mpModel → message_processing_model
- SnmpEngine: getMibBuilder() → get_mib_builder()
- MibSource: getMibSources() → get_mib_sources(), fullPath() → full_path()
- CommandGeneratorVarBinds: getMibViewController() → get_mib_view_controller()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Migrate commands.py to pysnmp 7.x API

Rename makeVarBinds/unmakeVarBinds/sendVarBinds to snake_case variants.
Add per-engine WeakKeyDictionary cache since make_varbinds/unmake_varbinds
now expect a Dict rather than a SnmpEngine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pysnmp 7.x event loop never stopping after SNMP response

In pysnmp 7.x, run_dispatcher() calls loop.run_forever() which does not
self-terminate when pending requests finish (unlike pysnmp 5.x asyncore).
Stop the loop explicitly in each callback and use the non-deprecated
transport_dispatcher.run_dispatcher() API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pysnmp 7.x varBindTable format and ensure loop always stops

pysnmp 7.x changed the GETNEXT/BULK callback varBindTable from a list
of rows to a flat list of (OID, val) pairs. The old row-iteration logic
raised exceptions before loop.stop() could be called, causing asyncio
to swallow the exception and leave run_forever() running indefinitely.

Fix varBindTable processing for getnext and bulk callbacks, and wrap
loop.stop() in finally blocks so the event loop always exits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix pysnmp 7.x compatibility: MIB sharing, constraint enforcement, thread event loop

- commands.py: Pre-populate engine cache with a MibViewController backed by the
  engine's own MibBuilder so vbProcessor.make_varbinds and the OIDResolver share
  the same MIB namespace. Without this, MIBs loaded during make_varbinds (TCP-MIB
  etc.) were invisible to the resolver when lookup_mib=False, causing metrics like
  tcpRtoAlgorithm to resolve to generic mib-2 entries instead of their names.

- commands.py: pysnmp 7.x bypasses constraint checking for received SimpleAsn1Type
  values in resolve_with_mib (via direct _value assignment). Restore the original
  enforce_mib_constraints behaviour by explicitly cloning through the MIB syntax
  after resolution; this raises SmiError/ValueConstraintError for out-of-range values.
  Capture callback exceptions in cbCtx so they survive the asyncio event loop and
  get re-raised to the caller after run_dispatcher() returns.

- discovery.py: Worker threads have no asyncio event loop by default in Python
  3.10+. pysnmp 7.x needs one at transport construction time (UdpTransport.__init__
  calls asyncio.get_event_loop()). Create and set a new loop at the top of
  discover_instances() so all SNMP operations in the thread work correctly.

- mibs.py: Use the non-deprecated add_mib_sources() instead of addMibSources().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix asyncio event-loop race condition in SNMP discovery workers

pysnmp 7.x's AsyncioDispatcher captures asyncio.get_event_loop() at
construction time. When multiple ThreadPoolExecutor workers share the
same event loop (L_main or L_discovery), run_dispatcher() returns
immediately on all but the first worker (loop.is_running() == True),
leaving ctx={} empty. A subsequent ctx['error'] access then raises
KeyError, surfaced as "Failed to collect metrics - 'error'".

Additionally, abandoned pending requests leave stale timer callbacks
on the shared loop that fire during subsequent checks, stopping
run_forever() prematurely before the SNMP response callback fires.

Fix: give each worker thread its own asyncio event loop and re-create
the SnmpEngine within that context (so AsyncioDispatcher captures the
worker-local loop). Pass isolated_loop=True from the executor submit
call; the non-discovery single-device path is unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Close per-worker asyncio event loop after each device check

Each _check_device(isolated_loop=True) call creates a fresh asyncio loop
but never closed it, causing file descriptors and handles to accumulate in
reused ThreadPoolExecutor threads across check intervals.

Add an outer try/finally that cancels any pending tasks (pysnmp's
handle_timeout coroutine), waits for them to handle CancelledError, closes
the loop, and clears the thread's current-loop reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address code-review issues in pysnmp 7.x compatibility layer

- commands.py: protect _engine_cache with a threading.Lock to eliminate
  the check-then-set race on the WeakKeyDictionary under concurrent workers
- commands.py: use ctx.get('error') in _handle_error so a missing key
  (callback never fired) raises a clear error instead of a bare KeyError
- commands.py: add except/re-raise to snmp_getnext and snmp_bulk callbacks
  matching the snmp_get pattern; unmake_varbinds exceptions previously
  produced an opaque KeyError on ctx['var_bind_table'] instead of the
  underlying SmiError
- snmp.py: shadow config with copy.copy() before reinitialize_engine() so
  the shared InstanceConfig in discovered_instances is never mutated by
  worker threads; each worker gets its own _snmp_engine and device
- discovery.py: wrap the discovery while-loop in try/finally to close the
  event loop when the thread exits, releasing UDP socket FDs accumulated
  during network scans

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix changelog entry: use PR #23125 instead of #23095

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Restore previous event loop instead of setting to None after cleanup

asyncio.set_event_loop(None) marks _set_called=True on the policy, so
any subsequent asyncio.get_event_loop() in the same thread raises
RuntimeError even in the main thread (Python 3.12+). This broke tests
that call discover_instances() directly in the main thread (e.g.
test_discovery_tags), leaving all subsequent tests unable to create an
SnmpCheck / InstanceConfig.

Save and restore the caller's event loop (or None if there wasn't one)
instead of unconditionally clearing it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add clarifying comments to _check_device event loop management

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Trim verbose comments to single lines per project style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Drain accumulated pysnmp tasks between host scans in discover_instances

pysnmp 7.x schedules handle_timeout as an asyncio task. When a response
arrives first, loop.stop() is called by the response callback but the
handle_timeout task remains pending. On the next host's run_forever()
call, the stale task fires immediately and calls loop.stop() before the
new SNMP request completes, leaving ctx={} and causing rapid KeyError
looping (observed as write_persistent_cache called 50k+ times in
test_cache_building).

Cancel and drain all pending tasks after each fetch_sysobject_oid call
so the loop is clean before scanning the next host.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 7f3c485)
@dd-octo-sts dd-octo-sts bot requested review from a team as code owners April 3, 2026 15:07
@dd-octo-sts dd-octo-sts bot requested a review from alpatriciodd April 3, 2026 15:07
@Kyle-Neale Kyle-Neale changed the base branch from 7.78.x to backport-23124-to-7.78.x April 3, 2026 15:30
@Kyle-Neale Kyle-Neale changed the base branch from backport-23124-to-7.78.x to 7.78.x April 3, 2026 15:31
@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 bot commented Apr 6, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d7551bb | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 92.33716% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (7.78.x@3b3b36b). Learn more about missing BASE report.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dkirov-dd dkirov-dd merged commit c4acd40 into 7.78.x Apr 7, 2026
28 of 30 checks passed
@dkirov-dd dkirov-dd deleted the backport-23125-to-7.78.x branch April 7, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants