MD-TRT Support, Compile/Export, C++ and Python #4183
MD-TRT Support, Compile/Export, C++ and Python #4183narendasan wants to merge 30 commits intomainfrom
Conversation
- C++ runtime: NCCL communicator init via c10d, rank/world_size serialization, DynamicOutputAllocator, ABI version bump to 8 - Python runtime: distributed support in PythonTorchTensorRTModule and TorchTensorRTModule, NCCL library auto-detection - Conversion: native TRT DistCollective API (AllGather, ReduceScatter, AllReduce) with TRT-LLM plugin fallback - Graph lowering: fuse c10d_functional collectives + wait_tensor into single ops - Feature detection: native_trt_collectives flag, platform validation, graceful fallback chain - Build: conditional NCCL compilation via torch_nccl toolchain - Examples: tensor_parallel_simple_example.py, tensor_parallel_llama_llm.py
…g and enable DTensor decomposition
…hapes
Five interconnected fixes:
1. fold_get_attr_item_calls: fold scalar param .item() calls into Python
scalars before AOT tracing. Inside FakeTensorMode, even real-tensor
.item() calls raise DataDependentOutputException.
2. backends.py: three changes:
- call fold_get_attr_item_calls before entering FakeTensorMode
- detect vmap/higher-order ops and route them through aot_autograd
instead of aot_export_joint_simple (which doesn't handle HOPs)
- on TRT build failure, strip TRT-only kwargs (use_fp32_acc) from
the fallback graph before returning it to PyTorch
3. _decompositions.py: prevent SDPA from leaking back into the decomp
table via Core ATen Interchange ops even after being removed from
TORCH_TRT_DECOMPOSITIONS.
4. partitioning/common.py: lower the default max dynamic shape from
min*2^16 to min*2^12 — 65536 is too large for TRT to find kernel
implementations for attention ops.
5. _TorchTensorRTModule.py: move CPU scalar inputs to CUDA before
execution — aot_autograd lifts scalar attributes (e.g. head_dim^-0.5)
as explicit graph inputs; TRT requires all inputs on CUDA.
Also fixes remove_sym_nodes to match tensor sources by equality rather
than local_name so that GetItemSource bases (from torch.compile
dynamic=True) are matched correctly, and updates register_sdpa.py to
handle aten.scaled_dot_product_attention.default (the form produced after
aot_autograd) in addition to the flash/efficient variants.
67134da to
b5b1f5f
Compare
b5b1f5f to
1957cc4
Compare
| if id(engine) not in seen: | ||
| seen.add(id(engine)) | ||
| if getattr(engine, "is_md", False): | ||
| engine.set_group_name(group_name) |
There was a problem hiding this comment.
We need to track instances for clean up
93277a4 to
9e390eb
Compare
04b606e to
1b4e559
Compare
…stributed context usage in some tests
There was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-18 20:14:39.307048+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-18 20:15:03.544385+00:00
@@ -386,11 +386,10 @@
logger.debug(
"Barrier after execution context creation (distributed NCCL engine)"
)
dist.barrier()
-
if ENABLED_FEATURES.tensorrt_rtx:
self._setup_runtime_config()
self.context = self._create_context()
assert self.context is not None, "Failed to create execution context"There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h b/tmp/changes.txt
index cd8af65..615600d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h
+++ b/tmp/changes.txt
@@ -33,17 +33,17 @@ namespace core {
namespace runtime {
using FlattenedState = std::tuple<
- std::tuple<std::string, std::string>, // ABI_VERSION
- std::tuple<std::string, std::string>, // name
- std::tuple<std::string, std::string>, // device
- std::tuple<std::string, std::string>, // engine
- std::tuple<std::string, std::string>, // input binding names
- std::tuple<std::string, std::string>, // output binding names
- std::tuple<std::string, std::string>, // HW compatibility
- std::tuple<std::string, std::string>, // requires_output_allocator
- std::tuple<std::string, std::string>, // serialized metadata
- std::tuple<std::string, std::string>, // Platform
- std::tuple<std::string, std::string>, // Resource Allocation Strategy
+ std::tuple<std::string, std::string>, // ABI_VERSION
+ std::tuple<std::string, std::string>, // name
+ std::tuple<std::string, std::string>, // device
+ std::tuple<std::string, std::string>, // engine
+ std::tuple<std::string, std::string>, // input binding names
+ std::tuple<std::string, std::string>, // output binding names
+ std::tuple<std::string, std::string>, // HW compatibility
+ std::tuple<std::string, std::string>, // requires_output_allocator
+ std::tuple<std::string, std::string>, // serialized metadata
+ std::tuple<std::string, std::string>, // Platform
+ std::tuple<std::string, std::string>, // Resource Allocation Strategy
std::tuple<std::string, std::string>>; // requires_multidevice
struct TorchTRTRuntimeStates {
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-20 22:42:10.394317+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-20 22:42:31.904975+00:00
@@ -386,11 +386,10 @@
logger.debug(
"Barrier after execution context creation (distributed NCCL engine)"
)
dist.barrier()
-
if ENABLED_FEATURES.tensorrt_rtx:
self._setup_runtime_config()
self.context = self._create_context()
assert self.context is not None, "Failed to create execution context"There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h b/tmp/changes.txt
index cd8af65..615600d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.h
+++ b/tmp/changes.txt
@@ -33,17 +33,17 @@ namespace core {
namespace runtime {
using FlattenedState = std::tuple<
- std::tuple<std::string, std::string>, // ABI_VERSION
- std::tuple<std::string, std::string>, // name
- std::tuple<std::string, std::string>, // device
- std::tuple<std::string, std::string>, // engine
- std::tuple<std::string, std::string>, // input binding names
- std::tuple<std::string, std::string>, // output binding names
- std::tuple<std::string, std::string>, // HW compatibility
- std::tuple<std::string, std::string>, // requires_output_allocator
- std::tuple<std::string, std::string>, // serialized metadata
- std::tuple<std::string, std::string>, // Platform
- std::tuple<std::string, std::string>, // Resource Allocation Strategy
+ std::tuple<std::string, std::string>, // ABI_VERSION
+ std::tuple<std::string, std::string>, // name
+ std::tuple<std::string, std::string>, // device
+ std::tuple<std::string, std::string>, // engine
+ std::tuple<std::string, std::string>, // input binding names
+ std::tuple<std::string, std::string>, // output binding names
+ std::tuple<std::string, std::string>, // HW compatibility
+ std::tuple<std::string, std::string>, // requires_output_allocator
+ std::tuple<std::string, std::string>, // serialized metadata
+ std::tuple<std::string, std::string>, // Platform
+ std::tuple<std::string, std::string>, // Resource Allocation Strategy
std::tuple<std::string, std::string>>; // requires_multidevice
struct TorchTRTRuntimeStates {
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-21 00:20:02.596692+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2026-04-21 00:20:22.271000+00:00
@@ -386,11 +386,10 @@
logger.debug(
"Barrier after execution context creation (distributed NCCL engine)"
)
dist.barrier()
-
if ENABLED_FEATURES.tensorrt_rtx:
self._setup_runtime_config()
self.context = self._create_context()
assert self.context is not None, "Failed to create execution context"9d95aa9 to
fe59779
Compare
There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
} else if (nccl_groups.size() > 1) {
std::string names;
for (const auto& n : nccl_groups) {
- if (!names.empty()) names += ", ";
+ if (!names.empty())
+ names += ", ";
names += "'" + n + "'";
}
LOG_WARNING(
"This TRT engine requires NCCL but multiple NCCL process groups are registered ("
- << names << "). Cannot auto-select a group — NCCL bind deferred. "
- "Use the recommended workflow: "
- "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+ << names
+ << "). Cannot auto-select a group — NCCL bind deferred. "
+ "Use the recommended workflow: "
+ "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
} else {
LOG_WARNING(
"This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
} else if (nccl_groups.size() > 1) {
std::string names;
for (const auto& n : nccl_groups) {
- if (!names.empty()) names += ", ";
+ if (!names.empty())
+ names += ", ";
names += "'" + n + "'";
}
LOG_WARNING(
"This TRT engine requires NCCL but multiple NCCL process groups are registered ("
- << names << "). Cannot auto-select a group — NCCL bind deferred. "
- "Use the recommended workflow: "
- "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+ << names
+ << "). Cannot auto-select a group — NCCL bind deferred. "
+ "Use the recommended workflow: "
+ "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
} else {
LOG_WARNING(
"This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp b/tmp/changes.txt
index 4b91415..ae5232b 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/TRTEngine.cpp
+++ b/tmp/changes.txt
@@ -573,14 +573,16 @@ bool TRTEngine::bind_nccl_comm() {
} else if (nccl_groups.size() > 1) {
std::string names;
for (const auto& n : nccl_groups) {
- if (!names.empty()) names += ", ";
+ if (!names.empty())
+ names += ", ";
names += "'" + n + "'";
}
LOG_WARNING(
"This TRT engine requires NCCL but multiple NCCL process groups are registered ("
- << names << "). Cannot auto-select a group — NCCL bind deferred. "
- "Use the recommended workflow: "
- "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
+ << names
+ << "). Cannot auto-select a group — NCCL bind deferred. "
+ "Use the recommended workflow: "
+ "with torch_tensorrt.distributed.distributed_context(group, model) as m: m(inp)");
} else {
LOG_WARNING(
"This TRT engine requires NCCL (requires_native_multidevice=true) but no NCCL process group "
ERROR: Some files do not conform to style guidelines
Description
Opening this to test the CI
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: