Skip to content

Expert Parallelism: common C API + NCCL EP backend#3034

Merged
timmoon10 merged 55 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-2-commwindow
Jun 13, 2026
Merged

Expert Parallelism: common C API + NCCL EP backend#3034
timmoon10 merged 55 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-2-commwindow

Conversation

@phu0ngng

Copy link
Copy Markdown
Collaborator

Summary

First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.

Every network-bound payload tensor takes an optional NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.

Implementation

Public C API (transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})

Types: NVTEEpGroupConfig, NVTEEpLayerConfig, NVTEEpHandle, NVTECommWindow (side-band {ncclWindow_t window, size_t offset}; NCCL peer handles are not carried on NVTETensor).

Lifecycle (host-only, eager):

void     nvte_ep_initialize(void* ep_comm, NVTEEpGroupConfig group_config);
void     nvte_ep_shutdown(void);

uint64_t nvte_ep_register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);
  • nvte_ep_initialize — borrow an external ncclComm_t for the EP sub-group and init the singleton backend.

  • nvte_ep_shutdown — tear down the backend; idempotent; does not destroy ep_comm.

  • nvte_ep_register_layer — reserve a handle_id for a layer config and report the handle_mem buffer size the caller must allocate. The pair {id, mem} becomes the per-step NVTEEpHandle.

Per-step (allocation-free, CUDA-graph capturable)

void nvte_ep_prepare(NVTEEpHandle handle, NVTETensor topk_idx, NVTETensor token_counts,
                     size_t dispatch_output_per_expert_alignment, cudaStream_t stream);

void nvte_ep_dispatch(NVTEEpHandle handle, NVTETensor topk_idx,
                      NVTETensor tokens, NVTECommWindow tokens_win,
                      NVTETensor topk_weights, NVTECommWindow topk_weights_win,
                      NVTETensor recv_tokens, NVTECommWindow recv_tokens_win,
                      NVTETensor recv_topk_weights,  NVTECommWindow recv_topk_weights_win,
                      cudaStream_t stream);

void nvte_ep_combine(NVTEEpHandle handle, NVTETensor expert_out, NVTECommWindow expert_out_win,
                     NVTETensor result, cudaStream_t stream);

void nvte_ep_dispatch_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                          NVTETensor g_recv_topk_weights, NVTECommWindow g_recv_topk_weights_win,
                          NVTETensor grad_tokens, NVTETensor grad_topk_weights, cudaStream_t stream);

void nvte_ep_combine_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                         NVTETensor grad_expert_out, NVTECommWindow grad_expert_out_win,
                         cudaStream_t stream);
  • nvte_ep_prepare — all-gather the routing map and write routing maps to handle.mem.
  • nvte_ep_dispatch — scatter tokens and routing weights from source ranks to expert ranks. tokens, topk_weights, recv_tokens, recv_topk_weights each accept an optional symm-mem window for zero-copy.
  • nvte_ep_combine — scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies by recv_topk_weights). expert_out accepts a window.
  • nvte_ep_dispatch_bwd — backward of dispatch; routes token and weight grads back to source. grad and g_recv_topk_weights accept windows; the gathered outputs (grad_tokens, grad_topk_weights).
  • nvte_ep_combine_bwd — backward of combine; grad and grad_expert_out accept windows. Padded slots in grad_expert_out are zeroed.

Backend + build

  • NCCL EP backend (transformer_engine/common/ep/): EPBackend singleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helper make_payload_tensor() builds the per-call ncclEpTensor_t: when the caller's NVTECommWindow.window != nullptr it sets win_hdl + win_offset (zero-copy); otherwise it sets data from nvte_tensor_data(t) (HBM fallback).
  • Runtime gates (in EPBackend::initialize): SM>=90 (via cudaDeviceGetAttribute), NCCL>=2.30.4 (via ncclGetVersion), CUDA multicast/NVLS support.
  • Stub path: when NVTE_WITH_NCCL_EP=OFF, ep/ep_api_stub.cpp provides throwing nvte_ep_* stubs so framework bindings link unconditionally; failure surfaces at first nvte_ep_initialize.
  • Build wiring
    • setup.py builds libnccl_ep.so from 3rdparty/nccl by default; auto-disables NCCL EP when no requested CUDA arch >= 90. Explicit NVTE_BUILD_WITH_NCCL_EP=1 with all archs < 90 is treated as user error NVTE_BUILD_WITH_NCCL_EP=0 to opt out.
    • NCCL_HOME resolved dynamically: explicit env → /opt/nvidia/nccl, /usr/local/nccl, /usrldconfig -p fallback.

Testing

  • C++ distributed tests under tests/cpp_distributed/.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@phu0ngng phu0ngng requested a review from ptrendx as a code owner May 22, 2026 02:42
@greptile-apps

greptile-apps Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR lands the Expert Parallelism (EP) common C API and NCCL EP backend for TransformerEngine, providing the foundation for upcoming PyTorch and JAX framework bindings. It adds the nvte_ep_* C API in ep.h/comm_window.h, an EPBackend singleton backed by a pointer-keyed LRU handle cache, and build wiring that auto-gates on SM ≥ 90 / NCCL ≥ 2.30.4 at both configure and runtime.

  • Public C API (ep.h, comm_window.h): lifecycle (initialize/shutdown/handle_mem_size), per-step CUDA-graph-capturable ops (prepare/dispatch/combine and their backward passes), and NVTECommWindow for optional zero-copy symmetric-memory payloads. The stubs for the NVTE_WITH_NCCL_EP=OFF build path are co-located in ep_api.cpp under a preprocessor guard, ensuring signature consistency.
  • EPBackend singleton (ep_backend.cpp/ep_backend.h): implements an LRU handle cache keyed on handle_mem device pointer (with an XLA buffer-relocation fallback via fallback_layer_cfg_), dtype-width validation against the group's max_token_dtype, and a caller-owned NVTEShape pattern to keep ncclEpTensor_t.sizes pointers valid across NCCL calls.
  • Build wiring (setup.py, CMakeLists.txt): builds libnccl_ep.a from the 3rdparty/nccl submodule, expands NVTE_CUDA_ARCHS=native via nvidia-smi, raises a RuntimeError when no SM ≥ 90 arch is found after expansion, and warns when NCCL_HOME points to a path without nccl.h.

Confidence Score: 5/5

New foundational EP backend with no blocking correctness issues in the current code; the previously identified races, dangling pointers, and stub-signature mismatches appear addressed.

All previously raised issues (initialized_ TOCTOU, ep_group_ mutex races, desc.sizes lifetime, stub signature mismatch, comm_window.h NCCL header dependency, max_token_bytes hardcoding, empty arch_list guard) appear resolved in the current revision. The new findings are limited to an LRU-eviction latency concern and a missing first-call guard for mismatched layer_cfg — both non-blocking robustness issues that don't affect correct usage of the API.

transformer_engine/common/ep/ep_backend.cpp — LRU eviction path and prepare_handle_locked deserve a second look before the framework bindings land on top of this.

Important Files Changed

Filename Overview
transformer_engine/common/ep/ep_backend.cpp Core NCCL EP backend singleton — implements LRU handle cache, per-step ops (prepare/dispatch/combine/bwd), mutex discipline, and dtype validation. Previous thread issues (dangling sizes pointer, initialized_ TOCTOU, ep_group_ races, register_layer config mismatch) appear addressed; one new concern about ncclEpHandleDestroy called under mutex during LRU eviction.
transformer_engine/common/ep/ep_api.cpp C API shim: thin delegations to EPBackend when built with NCCL EP; complete throwing stubs in the same file under !NVTE_WITH_NCCL_EP guard. Stub signatures match ep.h; the previously flagged stub-divergence issue is resolved by co-locating stubs in this file.
transformer_engine/common/include/transformer_engine/ep.h Public C API header — well-structured with clear Doxygen, NVTEEpGroupConfig/NVTEEpLayerConfig, and lifecycle + per-step function declarations. Layer config mismatch between handle_mem_size and prepare is not enforced; no ABI version field (acknowledged TODO).
transformer_engine/common/include/transformer_engine/comm_window.h Public header for NVTECommWindow. Uses forward declaration of struct ncclWindow_vidmem instead of #include <nccl.h>, which resolves the build-time NCCL dependency but couples the public ABI to an internal NCCL type name.
setup.py Build wiring for NCCL EP submodule: native arch expansion via nvidia-smi, empty arch_list guard (RuntimeError), NCCL_HOME warning on invalid path. Previously flagged issues with native-arch handling and empty gencode appear addressed.
transformer_engine/common/ep/ep_backend.h Internal EPBackend singleton declaration. LRU cache (list + unordered_map), per-member mutex discipline, atomic initialized_ flag, and fallback_layer_cfg_ optional for XLA buffer-relocation workaround are all clearly documented.
tests/cpp_distributed/test_ep.cu Distributed C++ tests covering prepare/dispatch/combine and their backward passes with closed-form expected values. Deterministic routing and BF16/FP16/FP32 type coverage look solid.
qa/L1_cpp_distributed/test.sh QA harness updated to build test_comm_gemm and test_ep independently, accumulate failures instead of hard-stopping, and write JUnit XML. Error handling is improved over the original set -e approach.
tests/cpp_distributed/run_test_ep.sh MPI launch wrapper for EP tests. Pre-Hopper SM check via nvidia-smi exits cleanly with code 0; GTEST_XML_PREFIX per-rank output avoids write races. set -euo pipefail is appropriate here.

Sequence Diagram

sequenceDiagram
    participant FW as Framework (PyTorch/JAX)
    participant API as nvte_ep_* (ep_api.cpp)
    participant BE as EPBackend singleton
    participant Cache as LRU Handle Cache
    participant NCCL as NCCL EP (ncclEp*)

    FW->>API: nvte_ep_initialize(ep_comm, group_config)
    API->>BE: EPBackend::initialize()
    BE->>NCCL: ncclGetVersion() [version gate]
    BE->>NCCL: ncclEpCreateGroup(ep_group_)

    FW->>API: nvte_ep_handle_mem_size(layer_cfg)
    API->>BE: handle_mem_size() [acquires mutex]
    BE->>NCCL: ncclEpHandleMemSize()
    BE-->>FW: hm_size bytes

    Note over FW: allocate handle_mem[hm_size]

    FW->>API: nvte_ep_prepare(handle_mem, topk_idx, ...)
    API->>BE: prepare() [acquires mutex]
    BE->>Cache: prepare_handle_locked(handle_mem, layer_cfg)
    Cache->>NCCL: ncclEpInitHandle() [on cache miss]
    BE->>NCCL: ncclEpUpdateHandle() [routing AllGather, async stream]

    FW->>API: nvte_ep_dispatch(handle_mem, tokens, ...)
    API->>BE: dispatch() [acquires mutex]
    BE->>Cache: lookup_handle_locked(handle_mem)
    BE->>NCCL: ncclEpDispatch() [async stream]

    FW->>API: nvte_ep_combine(handle_mem, expert_out, ...)
    API->>BE: combine() [acquires mutex]
    BE->>Cache: lookup_handle_locked(handle_mem)
    BE->>NCCL: ncclEpCombine() [async stream]

    FW->>API: nvte_ep_shutdown()
    API->>BE: EPBackend::shutdown() [acquires mutex]
    BE->>NCCL: ncclEpHandleDestroy() [each LRU entry]
    BE->>NCCL: ncclEpGroupDestroy(ep_group_)
Loading

Reviews (21): Last reviewed commit: "make core to be RTLD_LAZY" | Re-trigger Greptile

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
Comment thread transformer_engine/common/ep/ep_backend.cpp
Comment thread setup.py Outdated
Comment thread setup.py
Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
@phu0ngng phu0ngng force-pushed the phuong/ep-2-commwindow branch from 099857f to 17e5126 Compare May 22, 2026 23:07
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/test_ep_common.h Outdated
Comment thread tests/cpp_distributed/test_ep_common.h Outdated
Comment thread tests/cpp_distributed/test_ep_common.h
Comment thread tests/cpp_distributed/test_ep_init.cu Outdated
Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
Comment thread tests/cpp_distributed/test_ep.cu Outdated
Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
Comment thread tests/cpp_distributed/test_ep_coverage.cu Outdated
Comment thread transformer_engine/common/ep/ep_backend.h Outdated
Comment thread transformer_engine/common/ep/ep_backend.h Outdated
Comment on lines +28 to +47
typedef struct {
int ep_size; /*!< EP world size. */
int num_experts; /*!< Total experts across all ranks. */
int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
/*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
int max_recv_tokens_per_rank;
int hidden_dim; /*!< Token hidden dimension. */
int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
/*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
int allow_handle_mem_reloc;
} NVTEEpGroupConfig;

/*! \brief Per-layer EP configuration. */
typedef struct {
int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
int top_k; /*!< Per-token expert fan-out. Required. */
size_t dispatch_output_per_expert_alignment;
/*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
* between nvte_ep_register_layer and nvte_ep_prepare. */
} NVTEEpLayerConfig;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this a public API then we should probably version those?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of other structs like this (like the quantization config) are opaque though. I'm fine with it being not versioned for now if we are fairly sure that there is not going to be churn with them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.

Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
@phu0ngng phu0ngng force-pushed the phuong/ep-2-commwindow branch from 1e74f99 to 319f9d5 Compare June 2, 2026 23:20
Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
phu0ngng and others added 18 commits June 12, 2026 12:12
…cclWindow in public header

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…t failures in L1 CI

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…irun in run_test_ep.sh

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
… NCCL EP build

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…de dir from its prefix

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…routing to global counter

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ader version log

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ticast

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…CL EP

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng force-pushed the phuong/ep-2-commwindow branch from 1abf478 to 3459461 Compare June 12, 2026 19:12
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

Pipeline 54600556

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: expert_parallel.h or expert_parallelism.h would be a less ambiguous file name, especially in a public-facing header.

Asking a few AIs "What does EP mean in deep learning?":

  • Claude: "Expectation Propagation", "Energy Propagation", or "Episode"
  • Gemini: "Epoch", various non-MoE models (EP-DNN, EP-PINN, EP-RNN)
  • ChatGPT: "Expert Parallelism", "Epoch", "Embedded Projection"

Comment on lines +28 to +47
typedef struct {
int ep_size; /*!< EP world size. */
int num_experts; /*!< Total experts across all ranks. */
int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
/*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
int max_recv_tokens_per_rank;
int hidden_dim; /*!< Token hidden dimension. */
int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
/*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
int allow_handle_mem_reloc;
} NVTEEpGroupConfig;

/*! \brief Per-layer EP configuration. */
typedef struct {
int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
int top_k; /*!< Per-token expert fan-out. Required. */
size_t dispatch_output_per_expert_alignment;
/*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
* between nvte_ep_register_layer and nvte_ep_prepare. */
} NVTEEpLayerConfig;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.

Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.

Comment thread setup.py
return [remove_dups(reqs) for reqs in [install_reqs, test_reqs]]


def _discover_nccl_home() -> str:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a standalone utility funciton in build_utils/utils.py. It's actually quite general and not specific to NCCL EP.

Comment thread setup.py
print(f"[NCCL EP] No arch >= 90 in NVTE_CUDA_ARCHS ('{archs}'); skipping build.")
build_with_nccl_ep = False
if build_with_nccl_ep:
nccl_home = build_nccl_ep_submodule()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that we compile libnccl_ep.a at a different time than when we compile libtransformer_engine.so.

  • libnccl_ep.a: Compiled during the Setuptools configuration stage, while processing setup.py.
  • libtransformer_engine.so: Compiled during the Setuptools build stage, configured via a custom setuptools.Extension.

It's not obviously wrong, but it does seem messy. I suppose there are also some edge case problems (if NCCL is not installed but is listed as a dependency, compiling NCCL EP in the configuration stage will fail but building during the build stage can succeed).

It seems the "proper" approach would be to treat libnccl_ep.a compilation as basically a part of compiling libtransformer_engine.so. However, CMakeExtension is a very general class for CMake projects and I'm not sure how messy it would be to make add the NCCL EP compilation logic.

Comment thread setup.py
)
gencode = " ".join(f"-gencode=arch=compute_{a},code=sm_{a}" for a in arch_list)

nproc = os.cpu_count() or 8

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems likely to break users with wimpy nodes. We can reuse an existing API to control build parallelism:

Suggested change
nproc = os.cpu_count() or 8
nproc = get_max_jobs_for_parallel_build()

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

Pipeline 54600556 - all tests passed.

@timmoon10 timmoon10 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has some minor issues, but deadlines are what they are and the changes are well-contained. We should incorporate suggestions into future PRs.

@timmoon10 timmoon10 merged commit c3396ee into NVIDIA:main Jun 13, 2026
22 of 34 checks passed
timmoon10 pushed a commit that referenced this pull request Jun 13, 2026
This reverts commit c3396ee.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
timmoon10 pushed a commit that referenced this pull request Jun 13, 2026
Revert "Expert Parallelism: common C API + NCCL EP backend (#3034)"

This reverts commit c3396ee.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants