Expert Parallelism: common C API + NCCL EP backend by phu0ngng · Pull Request #3034 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-22T02:42:51Z

Summary

First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.

Every network-bound payload tensor takes an optional NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.

Implementation

Public C API (`transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h}`)

Types: NVTEEpGroupConfig, NVTEEpLayerConfig, NVTEEpHandle, NVTECommWindow (side-band {ncclWindow_t window, size_t offset}; NCCL peer handles are not carried on NVTETensor).

Lifecycle (host-only, eager):

void     nvte_ep_initialize(void* ep_comm, NVTEEpGroupConfig group_config);
void     nvte_ep_shutdown(void);

uint64_t nvte_ep_register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);

nvte_ep_initialize — borrow an external ncclComm_t for the EP sub-group and init the singleton backend.
nvte_ep_shutdown — tear down the backend; idempotent; does not destroy ep_comm.
nvte_ep_register_layer — reserve a handle_id for a layer config and report the handle_mem buffer size the caller must allocate. The pair {id, mem} becomes the per-step NVTEEpHandle.

Per-step (allocation-free, CUDA-graph capturable)

void nvte_ep_prepare(NVTEEpHandle handle, NVTETensor topk_idx, NVTETensor token_counts,
                     size_t dispatch_output_per_expert_alignment, cudaStream_t stream);

void nvte_ep_dispatch(NVTEEpHandle handle, NVTETensor topk_idx,
                      NVTETensor tokens, NVTECommWindow tokens_win,
                      NVTETensor topk_weights, NVTECommWindow topk_weights_win,
                      NVTETensor recv_tokens, NVTECommWindow recv_tokens_win,
                      NVTETensor recv_topk_weights,  NVTECommWindow recv_topk_weights_win,
                      cudaStream_t stream);

void nvte_ep_combine(NVTEEpHandle handle, NVTETensor expert_out, NVTECommWindow expert_out_win,
                     NVTETensor result, cudaStream_t stream);

void nvte_ep_dispatch_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                          NVTETensor g_recv_topk_weights, NVTECommWindow g_recv_topk_weights_win,
                          NVTETensor grad_tokens, NVTETensor grad_topk_weights, cudaStream_t stream);

void nvte_ep_combine_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                         NVTETensor grad_expert_out, NVTECommWindow grad_expert_out_win,
                         cudaStream_t stream);

nvte_ep_prepare — all-gather the routing map and write routing maps to handle.mem.
nvte_ep_dispatch — scatter tokens and routing weights from source ranks to expert ranks. tokens, topk_weights, recv_tokens, recv_topk_weights each accept an optional symm-mem window for zero-copy.
nvte_ep_combine — scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies by recv_topk_weights). expert_out accepts a window.
nvte_ep_dispatch_bwd — backward of dispatch; routes token and weight grads back to source. grad and g_recv_topk_weights accept windows; the gathered outputs (grad_tokens, grad_topk_weights).
nvte_ep_combine_bwd — backward of combine; grad and grad_expert_out accept windows. Padded slots in grad_expert_out are zeroed.

Backend + build

NCCL EP backend (transformer_engine/common/ep/): EPBackend singleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helper make_payload_tensor() builds the per-call ncclEpTensor_t: when the caller's NVTECommWindow.window != nullptr it sets win_hdl + win_offset (zero-copy); otherwise it sets data from nvte_tensor_data(t) (HBM fallback).
Runtime gates (in EPBackend::initialize): SM>=90 (via cudaDeviceGetAttribute), NCCL>=2.30.4 (via ncclGetVersion), CUDA multicast/NVLS support.
Stub path: when NVTE_WITH_NCCL_EP=OFF, ep/ep_api_stub.cpp provides throwing nvte_ep_* stubs so framework bindings link unconditionally; failure surfaces at first nvte_ep_initialize.
Build wiring
- setup.py builds libnccl_ep.so from 3rdparty/nccl by default; auto-disables NCCL EP when no requested CUDA arch >= 90. Explicit NVTE_BUILD_WITH_NCCL_EP=1 with all archs < 90 is treated as user error NVTE_BUILD_WITH_NCCL_EP=0 to opt out.
- NCCL_HOME resolved dynamically: explicit env → /opt/nvidia/nccl, /usr/local/nccl, /usr → ldconfig -p fallback.

Testing

C++ distributed tests under tests/cpp_distributed/.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-22T02:48:16Z

Greptile Summary

This PR lands the Expert Parallelism (EP) common C API and NCCL EP backend for TransformerEngine, providing the foundation for upcoming PyTorch and JAX framework bindings. It adds the nvte_ep_* C API in ep.h/comm_window.h, an EPBackend singleton backed by a pointer-keyed LRU handle cache, and build wiring that auto-gates on SM ≥ 90 / NCCL ≥ 2.30.4 at both configure and runtime.

Public C API (ep.h, comm_window.h): lifecycle (initialize/shutdown/handle_mem_size), per-step CUDA-graph-capturable ops (prepare/dispatch/combine and their backward passes), and NVTECommWindow for optional zero-copy symmetric-memory payloads. The stubs for the NVTE_WITH_NCCL_EP=OFF build path are co-located in ep_api.cpp under a preprocessor guard, ensuring signature consistency.
EPBackend singleton (ep_backend.cpp/ep_backend.h): implements an LRU handle cache keyed on handle_mem device pointer (with an XLA buffer-relocation fallback via fallback_layer_cfg_), dtype-width validation against the group's max_token_dtype, and a caller-owned NVTEShape pattern to keep ncclEpTensor_t.sizes pointers valid across NCCL calls.
Build wiring (setup.py, CMakeLists.txt): builds libnccl_ep.a from the 3rdparty/nccl submodule, expands NVTE_CUDA_ARCHS=native via nvidia-smi, raises a RuntimeError when no SM ≥ 90 arch is found after expansion, and warns when NCCL_HOME points to a path without nccl.h.

Confidence Score: 5/5

New foundational EP backend with no blocking correctness issues in the current code; the previously identified races, dangling pointers, and stub-signature mismatches appear addressed.

All previously raised issues (initialized_ TOCTOU, ep_group_ mutex races, desc.sizes lifetime, stub signature mismatch, comm_window.h NCCL header dependency, max_token_bytes hardcoding, empty arch_list guard) appear resolved in the current revision. The new findings are limited to an LRU-eviction latency concern and a missing first-call guard for mismatched layer_cfg — both non-blocking robustness issues that don't affect correct usage of the API.

transformer_engine/common/ep/ep_backend.cpp — LRU eviction path and prepare_handle_locked deserve a second look before the framework bindings land on top of this.

Important Files Changed

Filename	Overview
transformer_engine/common/ep/ep_backend.cpp	Core NCCL EP backend singleton — implements LRU handle cache, per-step ops (prepare/dispatch/combine/bwd), mutex discipline, and dtype validation. Previous thread issues (dangling sizes pointer, initialized_ TOCTOU, ep_group_ races, register_layer config mismatch) appear addressed; one new concern about ncclEpHandleDestroy called under mutex during LRU eviction.
transformer_engine/common/ep/ep_api.cpp	C API shim: thin delegations to EPBackend when built with NCCL EP; complete throwing stubs in the same file under !NVTE_WITH_NCCL_EP guard. Stub signatures match ep.h; the previously flagged stub-divergence issue is resolved by co-locating stubs in this file.
transformer_engine/common/include/transformer_engine/ep.h	Public C API header — well-structured with clear Doxygen, NVTEEpGroupConfig/NVTEEpLayerConfig, and lifecycle + per-step function declarations. Layer config mismatch between handle_mem_size and prepare is not enforced; no ABI version field (acknowledged TODO).
transformer_engine/common/include/transformer_engine/comm_window.h	Public header for NVTECommWindow. Uses forward declaration of struct ncclWindow_vidmem instead of #include <nccl.h>, which resolves the build-time NCCL dependency but couples the public ABI to an internal NCCL type name.
setup.py	Build wiring for NCCL EP submodule: native arch expansion via nvidia-smi, empty arch_list guard (RuntimeError), NCCL_HOME warning on invalid path. Previously flagged issues with native-arch handling and empty gencode appear addressed.
transformer_engine/common/ep/ep_backend.h	Internal EPBackend singleton declaration. LRU cache (list + unordered_map), per-member mutex discipline, atomic initialized_ flag, and fallback_layer_cfg_ optional for XLA buffer-relocation workaround are all clearly documented.
tests/cpp_distributed/test_ep.cu	Distributed C++ tests covering prepare/dispatch/combine and their backward passes with closed-form expected values. Deterministic routing and BF16/FP16/FP32 type coverage look solid.
qa/L1_cpp_distributed/test.sh	QA harness updated to build test_comm_gemm and test_ep independently, accumulate failures instead of hard-stopping, and write JUnit XML. Error handling is improved over the original set -e approach.
tests/cpp_distributed/run_test_ep.sh	MPI launch wrapper for EP tests. Pre-Hopper SM check via nvidia-smi exits cleanly with code 0; GTEST_XML_PREFIX per-rank output avoids write races. set -euo pipefail is appropriate here.

Sequence Diagram

sequenceDiagram
    participant FW as Framework (PyTorch/JAX)
    participant API as nvte_ep_* (ep_api.cpp)
    participant BE as EPBackend singleton
    participant Cache as LRU Handle Cache
    participant NCCL as NCCL EP (ncclEp*)

    FW->>API: nvte_ep_initialize(ep_comm, group_config)
    API->>BE: EPBackend::initialize()
    BE->>NCCL: ncclGetVersion() [version gate]
    BE->>NCCL: ncclEpCreateGroup(ep_group_)

    FW->>API: nvte_ep_handle_mem_size(layer_cfg)
    API->>BE: handle_mem_size() [acquires mutex]
    BE->>NCCL: ncclEpHandleMemSize()
    BE-->>FW: hm_size bytes

    Note over FW: allocate handle_mem[hm_size]

    FW->>API: nvte_ep_prepare(handle_mem, topk_idx, ...)
    API->>BE: prepare() [acquires mutex]
    BE->>Cache: prepare_handle_locked(handle_mem, layer_cfg)
    Cache->>NCCL: ncclEpInitHandle() [on cache miss]
    BE->>NCCL: ncclEpUpdateHandle() [routing AllGather, async stream]

    FW->>API: nvte_ep_dispatch(handle_mem, tokens, ...)
    API->>BE: dispatch() [acquires mutex]
    BE->>Cache: lookup_handle_locked(handle_mem)
    BE->>NCCL: ncclEpDispatch() [async stream]

    FW->>API: nvte_ep_combine(handle_mem, expert_out, ...)
    API->>BE: combine() [acquires mutex]
    BE->>Cache: lookup_handle_locked(handle_mem)
    BE->>NCCL: ncclEpCombine() [async stream]

    FW->>API: nvte_ep_shutdown()
    API->>BE: EPBackend::shutdown() [acquires mutex]
    BE->>NCCL: ncclEpHandleDestroy() [each LRU entry]
    BE->>NCCL: ncclEpGroupDestroy(ep_group_)

_{Reviews (21): Last reviewed commit: "make core to be RTLD_LAZY" | Re-trigger Greptile}

ptrendx · 2026-05-27T19:04:13Z

+typedef struct {
+  int ep_size;             /*!< EP world size. */
+  int num_experts;         /*!< Total experts across all ranks. */
+  int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
+  /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
+  int max_recv_tokens_per_rank;
+  int hidden_dim;  /*!< Token hidden dimension. */
+  int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
+  /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
+  int allow_handle_mem_reloc;
+} NVTEEpGroupConfig;
+
+/*! \brief Per-layer EP configuration. */
+typedef struct {
+  int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
+  int top_k;             /*!< Per-token expert fan-out. Required. */
+  size_t dispatch_output_per_expert_alignment;
+  /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
+   *   between nvte_ep_register_layer and nvte_ep_prepare. */
+} NVTEEpLayerConfig;


If we make this a public API then we should probably version those?

Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.

A lot of other structs like this (like the quantization config) are opaque though. I'm fine with it being not versioned for now if we are fairly sure that there is not going to be churn with them.

EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.

Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.

…cclWindow in public header Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…t failures in L1 CI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…irun in run_test_ep.sh Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

… NCCL EP build Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…de dir from its prefix Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…routing to global counter Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ader version log Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ticast Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…CL EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng · 2026-06-12T19:14:09Z

/te-ci L1

phu0ngng · 2026-06-12T20:10:57Z

Pipeline 54600556

timmoon10 · 2026-06-12T22:25:33Z

Nit: expert_parallel.h or expert_parallelism.h would be a less ambiguous file name, especially in a public-facing header.

Asking a few AIs "What does EP mean in deep learning?":

Claude: "Expectation Propagation", "Energy Propagation", or "Episode"

Gemini: "Epoch", various non-MoE models (EP-DNN, EP-PINN, EP-RNN)

ChatGPT: "Expert Parallelism", "Epoch", "Embedded Projection"

timmoon10 · 2026-06-12T22:33:08Z

+typedef struct {
+  int ep_size;             /*!< EP world size. */
+  int num_experts;         /*!< Total experts across all ranks. */
+  int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
+  /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
+  int max_recv_tokens_per_rank;
+  int hidden_dim;  /*!< Token hidden dimension. */
+  int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
+  /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
+  int allow_handle_mem_reloc;
+} NVTEEpGroupConfig;
+
+/*! \brief Per-layer EP configuration. */
+typedef struct {
+  int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
+  int top_k;             /*!< Per-token expert fan-out. Required. */
+  size_t dispatch_output_per_expert_alignment;
+  /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
+   *   between nvte_ep_register_layer and nvte_ep_prepare. */
+} NVTEEpLayerConfig;


EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.

Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.

timmoon10 · 2026-06-12T23:12:37Z

    return [remove_dups(reqs) for reqs in [install_reqs, test_reqs]]


+def _discover_nccl_home() -> str:


This could be a standalone utility funciton in build_utils/utils.py. It's actually quite general and not specific to NCCL EP.

timmoon10 · 2026-06-12T23:37:07Z

+            print(f"[NCCL EP] No arch >= 90 in NVTE_CUDA_ARCHS ('{archs}'); skipping build.")
+            build_with_nccl_ep = False
+    if build_with_nccl_ep:
+        nccl_home = build_nccl_ep_submodule()


Flagging that we compile libnccl_ep.a at a different time than when we compile libtransformer_engine.so.

libnccl_ep.a: Compiled during the Setuptools configuration stage, while processing setup.py.

libtransformer_engine.so: Compiled during the Setuptools build stage, configured via a custom setuptools.Extension.

It's not obviously wrong, but it does seem messy. I suppose there are also some edge case problems (if NCCL is not installed but is listed as a dependency, compiling NCCL EP in the configuration stage will fail but building during the build stage can succeed).

It seems the "proper" approach would be to treat libnccl_ep.a compilation as basically a part of compiling libtransformer_engine.so. However, CMakeExtension is a very general class for CMake projects and I'm not sure how messy it would be to make add the NCCL EP compilation logic.

timmoon10 · 2026-06-12T23:38:43Z

+        )
+    gencode = " ".join(f"-gencode=arch=compute_{a},code=sm_{a}" for a in arch_list)
+
+    nproc = os.cpu_count() or 8


This seems likely to break users with wimpy nodes. We can reuse an existing API to control build parallelism:

Suggested change

nproc = os.cpu_count() or 8

nproc = get_max_jobs_for_parallel_build()

phu0ngng · 2026-06-13T00:29:18Z

Pipeline 54600556 - all tests passed.

timmoon10

This still has some minor issues, but deadlines are what they are and the changes are well-contained. We should incorporate suggestions into future PRs.

This reverts commit c3396ee. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Revert "Expert Parallelism: common C API + NCCL EP backend (#3034)" This reverts commit c3396ee. Signed-off-by: Tim Moon <tmoon@nvidia.com>

phu0ngng requested a review from ptrendx as a code owner May 22, 2026 02:42

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated

Comment thread transformer_engine/common/ep/ep_backend.cpp

Comment thread setup.py Outdated

Comment thread setup.py

This was referenced May 22, 2026

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035

Open

[JAX] Expert Parallelism: JAX primitives + VJPs #3036

Open

[Common] Initial NCCL EP integration + Distributed CPP unit tests #3023

Open

phu0ngng requested a review from timmoon10 May 22, 2026 16:17

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated

phu0ngng force-pushed the phuong/ep-2-commwindow branch from 099857f to 17e5126 Compare May 22, 2026 23:07

ptrendx reviewed May 26, 2026

View reviewed changes

Comment thread tests/cpp_distributed/CMakeLists.txt Outdated

ptrendx reviewed May 26, 2026

View reviewed changes

Comment thread tests/cpp_distributed/CMakeLists.txt Outdated