[Backport 7.78.x] Fix pysnmp 7.x API compatibility and asyncio event-loop race condition#23158
Merged
[Backport 7.78.x] Fix pysnmp 7.x API compatibility and asyncio event-loop race condition#23158
Conversation
#23125) * Bump pyasn1 from 0.4.8 to 0.6.3 to fix CVE-2026-30922 CVE-2026-30922 (CVSS 7.5 High): pyasn1 prior to 0.6.3 is vulnerable to DoS via unbounded recursion when decoding deeply nested ASN.1 structures. The fix introduces MAX_NESTING_DEPTH=100. VULN-59246 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add changelog entry for pyasn1 bump (PR #23095) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update license metadata for pyasn1 0.6.3; fix orjson license parsing - pyasn1 0.6.3 changed from BSD-3-Clause to BSD-2-Clause - Add orjson override to .ddev/config.toml to handle its compound MPL-2.0 AND (Apache-2.0 OR MIT) license expression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bump pysnmp 5.1.0 to 7.1.22 to fix ModuleNotFoundError with pyasn1 0.6.3 pyasn1 0.6.x removed the `compat` module; pysnmp 5.x imported from it (pysnmp/debug.py: `from pyasn1.compat.octets import octs2ints`). pysnmp 7.1.22 drops this import and explicitly supports pyasn1 0.6.3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update pysnmp imports to 7.x API pysnmp 7.x removed the asyncore backend; update import paths: - CommunityData and friends: pysnmp.hlapi → pysnmp.hlapi.v3arch.asyncio - lcd singleton: pysnmp.hlapi.asyncore.cmdgen → instantiate from pysnmp.hlapi.v3arch.asyncio.lcd.CommandGeneratorLcdConfigurator - vbProcessor singleton: pysnmp.hlapi.asyncore.cmdgen → instantiate from pysnmp.hlapi.varbinds.CommandGeneratorVarBinds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix import ordering after pysnmp 7.x API migration Move module-level singleton instantiations (lcd, vbProcessor) to after all import statements to satisfy E402 lint rule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pysnmp 7.x API incompatibilities pysnmp 7.x removed the re-export of auth constants and ObjectType from pysnmp.hlapi; use pysnmp.hlapi.v3arch.asyncio as the hlapi module alias so getattr(hlapi, protocol_name) and hlapi.ObjectType still work. UdpTransportTarget.create() is now async-only in 7.x; resolve the transport address synchronously via socket.getaddrinfo() to preserve the synchronous register_device_target() call path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pysnmp 7.x renamed attributes - CommunityData/UsmUserData: mpModel → message_processing_model - SnmpEngine: getMibBuilder() → get_mib_builder() - MibSource: getMibSources() → get_mib_sources(), fullPath() → full_path() - CommandGeneratorVarBinds: getMibViewController() → get_mib_view_controller() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Migrate commands.py to pysnmp 7.x API Rename makeVarBinds/unmakeVarBinds/sendVarBinds to snake_case variants. Add per-engine WeakKeyDictionary cache since make_varbinds/unmake_varbinds now expect a Dict rather than a SnmpEngine. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pysnmp 7.x event loop never stopping after SNMP response In pysnmp 7.x, run_dispatcher() calls loop.run_forever() which does not self-terminate when pending requests finish (unlike pysnmp 5.x asyncore). Stop the loop explicitly in each callback and use the non-deprecated transport_dispatcher.run_dispatcher() API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pysnmp 7.x varBindTable format and ensure loop always stops pysnmp 7.x changed the GETNEXT/BULK callback varBindTable from a list of rows to a flat list of (OID, val) pairs. The old row-iteration logic raised exceptions before loop.stop() could be called, causing asyncio to swallow the exception and leave run_forever() running indefinitely. Fix varBindTable processing for getnext and bulk callbacks, and wrap loop.stop() in finally blocks so the event loop always exits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix pysnmp 7.x compatibility: MIB sharing, constraint enforcement, thread event loop - commands.py: Pre-populate engine cache with a MibViewController backed by the engine's own MibBuilder so vbProcessor.make_varbinds and the OIDResolver share the same MIB namespace. Without this, MIBs loaded during make_varbinds (TCP-MIB etc.) were invisible to the resolver when lookup_mib=False, causing metrics like tcpRtoAlgorithm to resolve to generic mib-2 entries instead of their names. - commands.py: pysnmp 7.x bypasses constraint checking for received SimpleAsn1Type values in resolve_with_mib (via direct _value assignment). Restore the original enforce_mib_constraints behaviour by explicitly cloning through the MIB syntax after resolution; this raises SmiError/ValueConstraintError for out-of-range values. Capture callback exceptions in cbCtx so they survive the asyncio event loop and get re-raised to the caller after run_dispatcher() returns. - discovery.py: Worker threads have no asyncio event loop by default in Python 3.10+. pysnmp 7.x needs one at transport construction time (UdpTransport.__init__ calls asyncio.get_event_loop()). Create and set a new loop at the top of discover_instances() so all SNMP operations in the thread work correctly. - mibs.py: Use the non-deprecated add_mib_sources() instead of addMibSources(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix asyncio event-loop race condition in SNMP discovery workers pysnmp 7.x's AsyncioDispatcher captures asyncio.get_event_loop() at construction time. When multiple ThreadPoolExecutor workers share the same event loop (L_main or L_discovery), run_dispatcher() returns immediately on all but the first worker (loop.is_running() == True), leaving ctx={} empty. A subsequent ctx['error'] access then raises KeyError, surfaced as "Failed to collect metrics - 'error'". Additionally, abandoned pending requests leave stale timer callbacks on the shared loop that fire during subsequent checks, stopping run_forever() prematurely before the SNMP response callback fires. Fix: give each worker thread its own asyncio event loop and re-create the SnmpEngine within that context (so AsyncioDispatcher captures the worker-local loop). Pass isolated_loop=True from the executor submit call; the non-discovery single-device path is unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Close per-worker asyncio event loop after each device check Each _check_device(isolated_loop=True) call creates a fresh asyncio loop but never closed it, causing file descriptors and handles to accumulate in reused ThreadPoolExecutor threads across check intervals. Add an outer try/finally that cancels any pending tasks (pysnmp's handle_timeout coroutine), waits for them to handle CancelledError, closes the loop, and clears the thread's current-loop reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Address code-review issues in pysnmp 7.x compatibility layer - commands.py: protect _engine_cache with a threading.Lock to eliminate the check-then-set race on the WeakKeyDictionary under concurrent workers - commands.py: use ctx.get('error') in _handle_error so a missing key (callback never fired) raises a clear error instead of a bare KeyError - commands.py: add except/re-raise to snmp_getnext and snmp_bulk callbacks matching the snmp_get pattern; unmake_varbinds exceptions previously produced an opaque KeyError on ctx['var_bind_table'] instead of the underlying SmiError - snmp.py: shadow config with copy.copy() before reinitialize_engine() so the shared InstanceConfig in discovered_instances is never mutated by worker threads; each worker gets its own _snmp_engine and device - discovery.py: wrap the discovery while-loop in try/finally to close the event loop when the thread exits, releasing UDP socket FDs accumulated during network scans Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix changelog entry: use PR #23125 instead of #23095 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Restore previous event loop instead of setting to None after cleanup asyncio.set_event_loop(None) marks _set_called=True on the policy, so any subsequent asyncio.get_event_loop() in the same thread raises RuntimeError even in the main thread (Python 3.12+). This broke tests that call discover_instances() directly in the main thread (e.g. test_discovery_tags), leaving all subsequent tests unable to create an SnmpCheck / InstanceConfig. Save and restore the caller's event loop (or None if there wasn't one) instead of unconditionally clearing it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add clarifying comments to _check_device event loop management Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Trim verbose comments to single lines per project style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Drain accumulated pysnmp tasks between host scans in discover_instances pysnmp 7.x schedules handle_timeout as an asyncio task. When a response arrives first, loop.stop() is called by the response callback but the handle_timeout task remains pending. On the next host's run_forever() call, the stale task fires immediately and calls loop.stop() before the new SNMP request completes, leaving ctx={} and causing rapid KeyError looping (observed as write_persistent_cache called 50k+ times in test_cache_building). Cancel and drain all pending tasks after each fetch_sysobject_oid call so the loop is clean before scanning the next host. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 7f3c485)
steveny91
approved these changes
Apr 6, 2026
Contributor
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: d7551bb | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 7f3c485 from #23125.
Summary
Companion to #23124. Fixes all breakage introduced by the pysnmp 5.x → 7.x upgrade:
sendVarBinds→send_varbinds,runDispatcher→run_dispatcher, etc.)varBindTablehandling — pysnmp 7.x returns a flat list of(OID, val)pairs instead of nested rowsloop.stop())AsyncioDispatchercapturesasyncio.get_event_loop()at construction time. When multipleThreadPoolExecutorworkers shared the same loop,run_dispatcher()returned immediately on all but the first worker (loop.is_running() == True), leavingctx={}and raisingKeyError('error'). Each worker now creates its own isolated event loop and re-creates theSnmpEnginewithin that context.Motivating issue: VULN-59246