Skip to content

Nightly regression: 6 tests in test_graph_builder.py fail intermittently on linux-aarch64 py3.14/3.14t since #2008 #2290

Description

@leofang

Since #2008 merged on 2026-06-29, the Nightly standard (linux-aarch64) / Python 3.14, CUDA 13.3.0 (local), GPU l4 (x2) job (and its 3.14t sibling) in the "CI: Nightly optional-deps" workflow has intermittently failed on the same 6 tests in cuda_core/tests/graph/test_graph_builder.py (all newly added in #2008):

  • test_graph_begin_building_twice — raises CUDAError(CUDA_ERROR_ILLEGAL_STATE) instead of the expected RuntimeError("Graph builder is already building.")
  • test_graph_split_requires_buildingDID NOT RAISE RuntimeError (expected: "Graph builder must be building before it can be split.")
  • test_graph_complete_after_close_forked — raises RuntimeError("Graph has not finished building.") instead of "Graph builder has been closed."
  • test_graph_update_after_source_closeTypeError: int() argument must be … not 'NoneType' at _graph_builder.pyx:843, instead of ValueError("Source graph builder has been closed.")
  • test_graph_embed_non_builderAttributeError: 'object' object has no attribute '_building_ended' at _graph_builder.pyx:689, instead of the intended TypeError from the isinstance check
  • test_graph_close_is_idempotent — after graph.close() called twice, int(graph.handle) == 0 fails because graph.handle is None (not the null-int handle the assertion expects)

Nightly history on main

Date HEAD aarch64 py3.14 Notes
2026-06-28 ea0215fd09 pass pre-#2008
2026-06-29 ea0215fd09 pass ran 03:25 UTC, before #2008 merged (21:04 UTC same day)
2026-06-30 dad6a421df 6 graph_builder tests failpy3.14 / py3.14t first nightly after #2008 (merge commit)
2026-07-01 f9f3849bd8 ❌ (unrelated nvshmem failure) — py3.14 / py3.14t graph_builder tests passed this night; only test_locate_bitcode_lib[nvshmem_device] failed (a separate, already-fixed issue via #2286)

Also observed on PR #2283 push run ec04c554bd at 2026-07-01T17:48 UTC — same 6 graph_builder failures — py3.14 / py3.14t. Then a /ok to test rerun of the exact same tree (synthetic head 58fd95efb4, run 28542961462) passed all 6.

Direct link into the failure block for the 2026-06-30 run: https://github.com/NVIDIA/cuda-python/actions/runs/28418107489/job/84205333659#step:35:3873

Diagnosis

Two observed failures, two observed passes on the same tree — so the tests exhibit non-deterministic behavior on linux-aarch64 py3.14/3.14t (L4 x2). pytest-randomly is in the test group, which reshuffles order per invocation; some of the 6 failures (e.g. test_graph_close_is_idempotent finding graph.handle is None where the test expects int(graph.handle) == 0) point at test-to-test state leakage or a lifecycle assumption that only holds under a specific ordering. Not observed on linux-64, win-64, or linux-aarch64 py3.12/3.13.

Because #2008 introduced both the tests and the underlying refactor, the fix should live either in _graph_builder.pyx (make the invariants hold regardless of ordering) or in the new tests (make them order-independent — e.g. assert graph.handle in (0, None)), whichever matches the intended semantics.

cc @Andy-Jost

Metadata

Metadata

Assignees

Labels

P0High priority - Must do!bugSomething isn't workingcuda.coreEverything related to the cuda.core module

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions