Skip to content

MD-TRT Support, Compile/Export, C++ and Python #4183

Open
narendasan wants to merge 30 commits intomainfrom
push-vqqzkszwrvyx
Open

MD-TRT Support, Compile/Export, C++ and Python #4183
narendasan wants to merge 30 commits intomainfrom
push-vqqzkszwrvyx

Conversation

@narendasan
Copy link
Copy Markdown
Collaborator

Description

Opening this to test the CI

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

apbose and others added 11 commits April 12, 2026 11:41
- C++ runtime: NCCL communicator init via c10d, rank/world_size serialization, DynamicOutputAllocator, ABI version bump to 8
- Python runtime: distributed support in PythonTorchTensorRTModule and TorchTensorRTModule, NCCL library auto-detection
- Conversion: native TRT DistCollective API (AllGather, ReduceScatter, AllReduce) with TRT-LLM plugin fallback
- Graph lowering: fuse c10d_functional collectives + wait_tensor into single ops
- Feature detection: native_trt_collectives flag, platform validation, graceful fallback chain
- Build: conditional NCCL compilation via torch_nccl toolchain
- Examples: tensor_parallel_simple_example.py, tensor_parallel_llama_llm.py
…hapes

Five interconnected fixes:

1. fold_get_attr_item_calls: fold scalar param .item() calls into Python
   scalars before AOT tracing. Inside FakeTensorMode, even real-tensor
   .item() calls raise DataDependentOutputException.

2. backends.py: three changes:
   - call fold_get_attr_item_calls before entering FakeTensorMode
   - detect vmap/higher-order ops and route them through aot_autograd
     instead of aot_export_joint_simple (which doesn't handle HOPs)
   - on TRT build failure, strip TRT-only kwargs (use_fp32_acc) from
     the fallback graph before returning it to PyTorch

3. _decompositions.py: prevent SDPA from leaking back into the decomp
   table via Core ATen Interchange ops even after being removed from
   TORCH_TRT_DECOMPOSITIONS.

4. partitioning/common.py: lower the default max dynamic shape from
   min*2^16 to min*2^12 — 65536 is too large for TRT to find kernel
   implementations for attention ops.

5. _TorchTensorRTModule.py: move CPU scalar inputs to CUDA before
   execution — aot_autograd lifts scalar attributes (e.g. head_dim^-0.5)
   as explicit graph inputs; TRT requires all inputs on CUDA.

Also fixes remove_sym_nodes to match tensor sources by equality rather
than local_name so that GetItemSource bases (from torch.compile
dynamic=True) are matched correctly, and updates register_sdpa.py to
handle aten.scaled_dot_product_attention.default (the form produced after
aot_autograd) in addition to the flash/efficient variants.
@meta-cla meta-cla bot added the cla signed label Apr 12, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation component: tests Issues re: Tests component: lowering Issues re: The lowering / preprocessing passes component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: converters Issues re: Specific op converters component: build system Issues re: Build system component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: torch_compile labels Apr 12, 2026
@github-actions github-actions bot requested a review from zewenli98 April 12, 2026 19:09
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Comment thread py/torch_tensorrt/distributed/_distributed.py
if id(engine) not in seen:
seen.add(id(engine))
if getattr(engine, "is_md", False):
engine.set_group_name(group_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this added?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to track instances for clean up

Comment thread py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@apbose apbose force-pushed the push-vqqzkszwrvyx branch from 04b606e to 1b4e559 Compare April 16, 2026 22:51
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-18 20:14:39.307048+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-18 20:15:03.544385+00:00
@@ -386,11 +386,10 @@
                logger.debug(
                    "Barrier after execution context creation (distributed NCCL engine)"
                )
                dist.barrier()

-
        if ENABLED_FEATURES.tensorrt_rtx:
            self._setup_runtime_config()

        self.context = self._create_context()
        assert self.context is not None, "Failed to create execution context"

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h b/tmp/changes.txt
index cd8af65..615600d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h
+++ b/tmp/changes.txt
@@ -33,17 +33,17 @@ namespace core {
namespace runtime {

using FlattenedState = std::tuple<
-    std::tuple<std::string, std::string>,  // ABI_VERSION
-    std::tuple<std::string, std::string>,  // name
-    std::tuple<std::string, std::string>,  // device
-    std::tuple<std::string, std::string>,  // engine
-    std::tuple<std::string, std::string>,  // input binding names
-    std::tuple<std::string, std::string>,  // output binding names
-    std::tuple<std::string, std::string>,  // HW compatibility
-    std::tuple<std::string, std::string>,  // requires_output_allocator
-    std::tuple<std::string, std::string>,  // serialized metadata
-    std::tuple<std::string, std::string>,  // Platform
-    std::tuple<std::string, std::string>,  // Resource Allocation Strategy
+    std::tuple<std::string, std::string>, // ABI_VERSION
+    std::tuple<std::string, std::string>, // name
+    std::tuple<std::string, std::string>, // device
+    std::tuple<std::string, std::string>, // engine
+    std::tuple<std::string, std::string>, // input binding names
+    std::tuple<std::string, std::string>, // output binding names
+    std::tuple<std::string, std::string>, // HW compatibility
+    std::tuple<std::string, std::string>, // requires_output_allocator
+    std::tuple<std::string, std::string>, // serialized metadata
+    std::tuple<std::string, std::string>, // Platform
+    std::tuple<std::string, std::string>, // Resource Allocation Strategy
    std::tuple<std::string, std::string>>; // requires_multidevice

struct TorchTRTRuntimeStates {
ERROR: Some files do not conform to style guidelines

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-20 22:42:10.394317+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-20 22:42:31.904975+00:00
@@ -386,11 +386,10 @@
                logger.debug(
                    "Barrier after execution context creation (distributed NCCL engine)"
                )
                dist.barrier()

-
        if ENABLED_FEATURES.tensorrt_rtx:
            self._setup_runtime_config()

        self.context = self._create_context()
        assert self.context is not None, "Failed to create execution context"

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h b/tmp/changes.txt
index cd8af65..615600d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h
+++ b/tmp/changes.txt
@@ -33,17 +33,17 @@ namespace core {
namespace runtime {

using FlattenedState = std::tuple<
-    std::tuple<std::string, std::string>,  // ABI_VERSION
-    std::tuple<std::string, std::string>,  // name
-    std::tuple<std::string, std::string>,  // device
-    std::tuple<std::string, std::string>,  // engine
-    std::tuple<std::string, std::string>,  // input binding names
-    std::tuple<std::string, std::string>,  // output binding names
-    std::tuple<std::string, std::string>,  // HW compatibility
-    std::tuple<std::string, std::string>,  // requires_output_allocator
-    std::tuple<std::string, std::string>,  // serialized metadata
-    std::tuple<std::string, std::string>,  // Platform
-    std::tuple<std::string, std::string>,  // Resource Allocation Strategy
+    std::tuple<std::string, std::string>, // ABI_VERSION
+    std::tuple<std::string, std::string>, // name
+    std::tuple<std::string, std::string>, // device
+    std::tuple<std::string, std::string>, // engine
+    std::tuple<std::string, std::string>, // input binding names
+    std::tuple<std::string, std::string>, // output binding names
+    std::tuple<std::string, std::string>, // HW compatibility
+    std::tuple<std::string, std::string>, // requires_output_allocator
+    std::tuple<std::string, std::string>, // serialized metadata
+    std::tuple<std::string, std::string>, // Platform
+    std::tuple<std::string, std::string>, // Resource Allocation Strategy
    std::tuple<std::string, std::string>>; // requires_multidevice

struct TorchTRTRuntimeStates {
ERROR: Some files do not conform to style guidelines

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-21 00:20:02.596692+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2026-04-21 00:20:22.271000+00:00
@@ -386,11 +386,10 @@
                logger.debug(
                    "Barrier after execution context creation (distributed NCCL engine)"
                )
                dist.barrier()

-
        if ENABLED_FEATURES.tensorrt_rtx:
            self._setup_runtime_config()

        self.context = self._create_context()
        assert self.context is not None, "Failed to create execution context"

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
    } else if (nccl_groups.size() > 1) {
      std::string names;
      for (const auto& n : nccl_groups) {
-        if (!names.empty()) names += ", ";
+        if (!names.empty())
+          names += ", ";
        names += "'" + n + "'";
      }
      LOG_WARNING(
          "This TRT engine requires NCCL but multiple NCCL process groups are registered ("
-          << names << "). Cannot auto-select a group — NCCL bind deferred. "
-          "Use the recommended workflow: "
-          "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+          << names
+          << "). Cannot auto-select a group — NCCL bind deferred. "
+             "Use the recommended workflow: "
+             "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
    } else {
      LOG_WARNING(
          "This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelines

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
    } else if (nccl_groups.size() > 1) {
      std::string names;
      for (const auto& n : nccl_groups) {
-        if (!names.empty()) names += ", ";
+        if (!names.empty())
+          names += ", ";
        names += "'" + n + "'";
      }
      LOG_WARNING(
          "This TRT engine requires NCCL but multiple NCCL process groups are registered ("
-          << names << "). Cannot auto-select a group — NCCL bind deferred. "
-          "Use the recommended workflow: "
-          "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+          << names
+          << "). Cannot auto-select a group — NCCL bind deferred. "
+             "Use the recommended workflow: "
+             "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
    } else {
      LOG_WARNING(
          "This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelines

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
    } else if (nccl_groups.size() > 1) {
      std::string names;
      for (const auto& n : nccl_groups) {
-        if (!names.empty()) names += ", ";
+        if (!names.empty())
+          names += ", ";
        names += "'" + n + "'";
      }
      LOG_WARNING(
          "This TRT engine requires NCCL but multiple NCCL process groups are registered ("
-          << names << "). Cannot auto-select a group — NCCL bind deferred. "
-          "Use the recommended workflow: "
-          "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+          << names
+          << "). Cannot auto-select a group — NCCL bind deferred. "
+             "Use the recommended workflow: "
+             "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
    } else {
      LOG_WARNING(
          "This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes component: runtime component: tests Issues re: Tests component: torch_compile documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants