cuda.core: graph slot table for node attachment lifetimes by Andy-Jost · Pull Request #2280 · NVIDIA/cuda-python

Andy-Jost · 2026-06-30T00:18:22Z

Summary

CUDA graph nodes frequently reference Python-owned resources — kernel-argument buffers, host-callback functions and their user data, and memcpy/memset operands — but the driver does not keep those resources alive. The driver copies argument values into a node at add time; it does not retain the Python wrappers or the device allocations they point at. If the owning object is collected (or a Buffer is closed) before the graph is instantiated or launched, the node is left referencing freed memory.

This PR introduces a per-graph slot table that binds the lifetime of these attachments to the lifetime of the CUgraph itself. Each node gets a small, fixed set of slots; an owner placed in a slot is held until the graph is destroyed. The table is stored on the graph as a CUDA user object, so destroying (or cloning) the graph propagates ownership correctly with no per-node bookkeeping in Python.

It also exposes GraphBuilder.graph_definition, completing step 3 of #1330 by giving users an explicit GraphDefinition view of a captured graph.

Design

OpaqueHandle: a type-erased owning handle with two factories — make_opaque_py retains a PyObject* (incref/decref), and make_opaque_malloc owns a malloc'd buffer. A slot can therefore hold any kind of owner uniformly.
Per-graph slot table: a map from CUgraphNode to a fixed-size array of OpaqueHandle slots. It is created lazily on first attachment and retained on the CUgraph as a user object (cuUserObjectCreate + cuGraphRetainUserObject with MOVE). When the graph is destroyed, the user object's destructor frees the table and releases every owner it holds. Conditional-branch body graphs (ref handles) get their own table the same way.
Attachment API: graph_set_slot(graph, node, slot, owner) installs an owner into a node slot and returns CUresult for HANDLE_RETURN-style error handling. This replaces the previous approach of attaching an ad-hoc CUDA user object per resource at each node, along with its per-type heap-copy deleters.

Changes

Slot-table infrastructure in resource_handles.{hpp,cpp} (OpaqueHandle, the slot table, lazy creation, graph_set_slot) with its Cython surface in _resource_handles.{pxd,pyx}.
Graph nodes in _graph_node.pyx — kernel, event-record, event-wait, host-callback, memcpy, and memset — now store their owning handles in node slots. Stream-captured callbacks recover the just-captured host node from cuStreamGetCaptureInfo and use the same path; forked builders share the primary builder's graph handle, so their attachments land in the same table.
Kernel nodes retain the kernel-argument tuple (slot 1) so the Python objects backing the arguments — notably device Buffers — outlive the graph. This is the slot-table port of the user-object fix from cuda.core: keep kernel-argument objects alive in graph kernel nodes #2041.
GraphNode.memcpy/memset (and the GraphDefinition pass-throughs) now accept either a Buffer or a raw int address for each operand. A Buffer operand is retained at the allocation level (its DevicePtrHandle), so close()/reset cannot free memory the graph still references; a raw int behaves exactly as before (caller owns the lifetime), keeping the change backward compatible. Keyword-only dst_owner/src_owner arguments let callers attach an arbitrary owner to a raw-pointer operand; combining an owner with a Buffer operand is rejected.
graph/_utils is renamed to graph/_host_callback now that it holds only host-callback machinery, and a shared _attach_host_callback_owners unifies the eager (GN_callback) and capture (add_callback) attachment paths.
GraphBuilder.graph_definition exposes the captured graph as a GraphDefinition that shares ownership of the underlying CUgraph. State-guard rules: valid on a primary builder only after end_building(); valid on a conditional body both before begin_building() and after end_building(); never valid on a forked builder (access through the primary instead).
A test-only _utils/_weak_handles module provides weak_handle() for deterministic, refcount-free lifetime assertions.

Stream-capture lifetime contract

Operations recorded during stream capture reference caller-owned memory and are not retained, unlike explicit GraphDefinition construction. Host callbacks are the one exception: they are retained on both the capture and explicit paths. This contract is documented on GraphBuilder.

Test coverage

Slot-table lifetime tests for Buffer memcpy/memset operands (including clone), and for capture host callbacks retained after dropping their Python references.
dst_owner/src_owner retention verified with weakrefs, plus rejection of Buffer+owner combinations.
Device-allocation lifetime tests using weak_handle() to confirm an allocation survives reset/close while a graph still references it.
graph_definition tests: happy path, both hybrid conditional-body flows (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the GraphDefinition survives the builder's close()).

Related work

Builds on the handle-layer plumbing from cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup #2008.
Ports and generalizes the kernel-argument user-object fix from cuda.core: keep kernel-argument objects alive in graph kernel nodes #2041.
Completes step 3 of CUDA graph phase N - graph updates #1330.

Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit `GraphDefinition` view that shares ownership of the underlying `CUgraph`. The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the user-facing surface and locks in the state-guard rules. State semantics: - PRIMARY builder: only valid after `end_building()`. Before `begin_building()` no graph exists; during capture the driver is the sole writer, so explicit access is unsafe. - CONDITIONAL_BODY builder: valid both before `begin_building()` (the body graph is allocated at conditional-node creation time) and after `end_building()`. This enables a hybrid flow where a conditional body is populated entirely via the explicit API, with no capture at all. - FORKED builder: never valid. Forked builders share the primary's graph; access through the primary instead. Tests cover the happy path, both hybrid flows on conditional bodies (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the `GraphDefinition` survives the builder's `close()`). Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-06-30T00:18:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Andy-Jost · 2026-06-30T00:19:27Z

/ok to test

Introduce OpaqueHandle and a per-graph slot table retained on the CUgraph as a user object, preparing to replace ad-hoc per-resource user objects when wiring graph node attachments in a follow-up change.

github-actions · 2026-06-30T00:38:42Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2280/
https://nvidia.github.io/cuda-python/pr-preview/pr-2280/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2280/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2280/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

Replace the per-resource CUDA user objects attached at each graph node with the per-graph slot table from phase 1. Kernel, event-record, event-wait, and host-callback nodes now store their owning handles in node slots via graph_set_slot. Stream-captured callbacks map the just-captured host node from cuStreamGetCaptureInfo and use the same path; forked builders share the primary's graph handle so their attachments reach the same table. Refine the phase 1 surface to support this: the slot table is created lazily on first attachment, so conditional-branch bodies (ref handles) get one too, and graph_set_slot returns CUresult for HANDLE_RETURN-style error checking. Removes _attach_user_object and the per-type heap-copy deleters.

Rename graph/_utils to graph/_host_callback now that it holds only host-callback machinery (the trampoline, _is_py_host_trampoline, and _resolve_host_callback), matching the concept-named files around it, and update the three cimport sites. Add _attach_host_callback_owners to share the "callback -> slot 0, user_data -> slot 1" attachment between the eager (GN_callback) and capture (add_callback) paths. Guard a zero-length user_data copy against malloc(0) and hoist the per-call ctypes import. Attach the kernel-argument tuple to the kernel node's slot 1 so the Python objects backing the arguments -- notably device Buffers -- outlive the graph. The driver copies argument values into the node at add time but does not keep the referenced device memory alive, so without this a kernel node could be left with a stale device pointer. This is the slot-table port of the user-object fix from NVIDIA#2041 (currently only on main).

GraphNode.memcpy/memset (and the GraphDefinition pass-throughs) now accept a Buffer or a raw int for each address. A new _resolve_ptr helper reads the device pointer from a Buffer and returns it as an owner; a raw int casts through with no owner. GN_memcpy attaches a Buffer dst to slot 0 and src to slot 1, and GN_memset attaches dst to slot 0, so buffers passed by value outlive the graph. Raw ints behave exactly as before (caller owns the lifetime), so this is backward compatible. Document the stream-capture lifetime contract on GraphBuilder: operations recorded during capture reference caller-owned memory and are not retained, unlike explicit GraphDefinition construction. Host callbacks are the one exception, retained on both the capture and explicit paths.

… capture callbacks Cover GraphDefinition memset/memcpy with Buffer operands (including clone), and GraphBuilder capture host callbacks retained after dropping Python refs.

Keyword-only *_owner args retain arbitrary objects for raw pointer operands; Buffer+owner combinations are rejected. Strengthen owner tests with weakref retention checks and add src_owner rejection test.

Store DevicePtrHandle in slot table instead of Buffer wrappers so reset/close cannot release memory while a graph still references it. Add test-only weak_handle() for deterministic allocation lifetime checks and extend graph lifetime tests accordingly.

Andy-Jost · 2026-06-30T17:38:12Z

/ok to test

rwgk · 2026-06-30T23:21:39Z

I'm trying to help a little bit using gpt-5.5 on my side.

It'd be great to get into the habit of letting our agents add the authorship markers, at least to agent-generated tests.

Findings

P2: memset is no longer backward-compatible for existing positional height/pitch calls. Both GraphDefinition.memset(...) and GraphNode.memset(...) now insert * before height, so calls like g.memset(dst, value, width, height, pitch) start raising TypeError. That is a public API break unless intentional and documented. See cuda_core/cuda/core/graph/_graph_definition.pyx:158 and cuda_core/cuda/core/graph/_graph_node.pyx:285.
P2: GraphBuilder.graph_definition returns a GraphDefinition wrapping an empty/reset graph handle after gb.close(). close() resets _h_graph and sets CLOSED, but the property does not call GB_check_open() or check CLOSED, so later nodes(), instantiate(), etc. hit CUDA with a null graph instead of raising a clear builder-closed error. This only affects accessing the property after close; a view obtained before close is intended to remain valid. See cuda_core/cuda/core/graph/_graph_builder.pyx:280 and cuda_core/cuda/core/graph/_graph_builder.pyx:339.
P3: The newly added tests lack the repo’s explicit authorship markers. pytest.ini registers agent_authored, human_reviewed, and human_authored, and the repo guidance says newly added unit tests should carry one marker immediately above each test. This PR adds many tests without one, starting at cuda_core/tests/graph/test_graph_builder.py:295, cuda_core/tests/graph/test_graph_builder.py:444, and cuda_core/tests/graph/test_graph_definition_lifetime.py:634.

Andy-Jost added this to the cuda.core next milestone Jun 30, 2026

Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Jun 30, 2026

Andy-Jost self-assigned this Jun 30, 2026

cuda.core: add graph slot table infrastructure (phase 1)

c3740e6

Introduce OpaqueHandle and a per-graph slot table retained on the CUgraph as a user object, preparing to replace ad-hoc per-resource user objects when wiring graph node attachments in a follow-up change.

Andy-Jost added 6 commits June 29, 2026 17:46

cuda.core: add slot-table lifetime tests for Buffer memcpy/memset and…

813058d

… capture callbacks Cover GraphDefinition memset/memcpy with Buffer operands (including clone), and GraphBuilder capture host callbacks retained after dropping Python refs.

cuda.core: add explicit dst/src_owner for graph memcpy/memset

10268ff

Keyword-only *_owner args retain arbitrary objects for raw pointer operands; Buffer+owner combinations are rejected. Strengthen owner tests with weakref retention checks and add src_owner rejection test.

Andy-Jost force-pushed the ajost/graph-slots branch from 13dab04 to 621ade8 Compare June 30, 2026 17:29

Merge branch 'main' into ajost/graph-slots

8c1c420

Andy-Jost changed the title ~~cuda.core: graph slot table for node attachment lifetimes (draft)~~ cuda.core: graph slot table for node attachment lifetimes Jun 30, 2026

Andy-Jost marked this pull request as ready for review June 30, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda.core: graph slot table for node attachment lifetimes#2280

cuda.core: graph slot table for node attachment lifetimes#2280
Andy-Jost wants to merge 9 commits into
NVIDIA:mainfrom
Andy-Jost:ajost/graph-slots

Andy-Jost commented Jun 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

Andy-Jost commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Andy-Jost commented Jun 30, 2026

Uh oh!

rwgk commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Andy-Jost commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Changes

Stream-capture lifetime contract

Test coverage

Related work

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

Andy-Jost commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Andy-Jost commented Jun 30, 2026

Uh oh!

rwgk commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andy-Jost commented Jun 30, 2026 •

edited

Loading