Expert Parallelism: common C API + NCCL EP backend#3034
Conversation
Greptile SummaryThis PR lands the Expert Parallelism (EP) common C API and NCCL EP backend for TransformerEngine, providing the foundation for upcoming PyTorch and JAX framework bindings. It adds the
Confidence Score: 5/5New foundational EP backend with no blocking correctness issues in the current code; the previously identified races, dangling pointers, and stub-signature mismatches appear addressed. All previously raised issues (initialized_ TOCTOU, ep_group_ mutex races, desc.sizes lifetime, stub signature mismatch, comm_window.h NCCL header dependency, max_token_bytes hardcoding, empty arch_list guard) appear resolved in the current revision. The new findings are limited to an LRU-eviction latency concern and a missing first-call guard for mismatched layer_cfg — both non-blocking robustness issues that don't affect correct usage of the API. transformer_engine/common/ep/ep_backend.cpp — LRU eviction path and prepare_handle_locked deserve a second look before the framework bindings land on top of this. Important Files Changed
Sequence DiagramsequenceDiagram
participant FW as Framework (PyTorch/JAX)
participant API as nvte_ep_* (ep_api.cpp)
participant BE as EPBackend singleton
participant Cache as LRU Handle Cache
participant NCCL as NCCL EP (ncclEp*)
FW->>API: nvte_ep_initialize(ep_comm, group_config)
API->>BE: EPBackend::initialize()
BE->>NCCL: ncclGetVersion() [version gate]
BE->>NCCL: ncclEpCreateGroup(ep_group_)
FW->>API: nvte_ep_handle_mem_size(layer_cfg)
API->>BE: handle_mem_size() [acquires mutex]
BE->>NCCL: ncclEpHandleMemSize()
BE-->>FW: hm_size bytes
Note over FW: allocate handle_mem[hm_size]
FW->>API: nvte_ep_prepare(handle_mem, topk_idx, ...)
API->>BE: prepare() [acquires mutex]
BE->>Cache: prepare_handle_locked(handle_mem, layer_cfg)
Cache->>NCCL: ncclEpInitHandle() [on cache miss]
BE->>NCCL: ncclEpUpdateHandle() [routing AllGather, async stream]
FW->>API: nvte_ep_dispatch(handle_mem, tokens, ...)
API->>BE: dispatch() [acquires mutex]
BE->>Cache: lookup_handle_locked(handle_mem)
BE->>NCCL: ncclEpDispatch() [async stream]
FW->>API: nvte_ep_combine(handle_mem, expert_out, ...)
API->>BE: combine() [acquires mutex]
BE->>Cache: lookup_handle_locked(handle_mem)
BE->>NCCL: ncclEpCombine() [async stream]
FW->>API: nvte_ep_shutdown()
API->>BE: EPBackend::shutdown() [acquires mutex]
BE->>NCCL: ncclEpHandleDestroy() [each LRU entry]
BE->>NCCL: ncclEpGroupDestroy(ep_group_)
Reviews (21): Last reviewed commit: "make core to be RTLD_LAZY" | Re-trigger Greptile |
099857f to
17e5126
Compare
| typedef struct { | ||
| int ep_size; /*!< EP world size. */ | ||
| int num_experts; /*!< Total experts across all ranks. */ | ||
| int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */ | ||
| /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */ | ||
| int max_recv_tokens_per_rank; | ||
| int hidden_dim; /*!< Token hidden dimension. */ | ||
| int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */ | ||
| /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */ | ||
| int allow_handle_mem_reloc; | ||
| } NVTEEpGroupConfig; | ||
|
|
||
| /*! \brief Per-layer EP configuration. */ | ||
| typedef struct { | ||
| int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */ | ||
| int top_k; /*!< Per-token expert fan-out. Required. */ | ||
| size_t dispatch_output_per_expert_alignment; | ||
| /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match | ||
| * between nvte_ep_register_layer and nvte_ep_prepare. */ | ||
| } NVTEEpLayerConfig; |
There was a problem hiding this comment.
If we make this a public API then we should probably version those?
There was a problem hiding this comment.
Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.
There was a problem hiding this comment.
A lot of other structs like this (like the quantization config) are opaque though. I'm fine with it being not versioned for now if we are fairly sure that there is not going to be churn with them.
There was a problem hiding this comment.
EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.
Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.
1e74f99 to
319f9d5
Compare
…cclWindow in public header Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…t failures in L1 CI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…irun in run_test_ep.sh Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
… NCCL EP build Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…de dir from its prefix Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…routing to global counter Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ader version log Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ticast Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…CL EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
1abf478 to
3459461
Compare
|
/te-ci L1 |
|
Pipeline 54600556 |
There was a problem hiding this comment.
Nit: expert_parallel.h or expert_parallelism.h would be a less ambiguous file name, especially in a public-facing header.
Asking a few AIs "What does EP mean in deep learning?":
- Claude: "Expectation Propagation", "Energy Propagation", or "Episode"
- Gemini: "Epoch", various non-MoE models (EP-DNN, EP-PINN, EP-RNN)
- ChatGPT: "Expert Parallelism", "Epoch", "Embedded Projection"
| typedef struct { | ||
| int ep_size; /*!< EP world size. */ | ||
| int num_experts; /*!< Total experts across all ranks. */ | ||
| int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */ | ||
| /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */ | ||
| int max_recv_tokens_per_rank; | ||
| int hidden_dim; /*!< Token hidden dimension. */ | ||
| int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */ | ||
| /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */ | ||
| int allow_handle_mem_reloc; | ||
| } NVTEEpGroupConfig; | ||
|
|
||
| /*! \brief Per-layer EP configuration. */ | ||
| typedef struct { | ||
| int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */ | ||
| int top_k; /*!< Per-token expert fan-out. Required. */ | ||
| size_t dispatch_output_per_expert_alignment; | ||
| /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match | ||
| * between nvte_ep_register_layer and nvte_ep_prepare. */ | ||
| } NVTEEpLayerConfig; |
There was a problem hiding this comment.
EP optimization is specific to particular MoE models and workflows, and the field is advancing too quickly for me to have confidence that we can treat it as stable.
Best to get the design right before anyone starts using this code. Opaque structs are better for experimentation since we can more easily add or deprecate options, and it also allows us to have defaults that are more complicated than zero-initialization. It is more annoying to implement, but we already have a few examples and it's something Codex can do very well.
| return [remove_dups(reqs) for reqs in [install_reqs, test_reqs]] | ||
|
|
||
|
|
||
| def _discover_nccl_home() -> str: |
There was a problem hiding this comment.
This could be a standalone utility funciton in build_utils/utils.py. It's actually quite general and not specific to NCCL EP.
| print(f"[NCCL EP] No arch >= 90 in NVTE_CUDA_ARCHS ('{archs}'); skipping build.") | ||
| build_with_nccl_ep = False | ||
| if build_with_nccl_ep: | ||
| nccl_home = build_nccl_ep_submodule() |
There was a problem hiding this comment.
Flagging that we compile libnccl_ep.a at a different time than when we compile libtransformer_engine.so.
libnccl_ep.a: Compiled during the Setuptools configuration stage, while processingsetup.py.libtransformer_engine.so: Compiled during the Setuptools build stage, configured via a customsetuptools.Extension.
It's not obviously wrong, but it does seem messy. I suppose there are also some edge case problems (if NCCL is not installed but is listed as a dependency, compiling NCCL EP in the configuration stage will fail but building during the build stage can succeed).
It seems the "proper" approach would be to treat libnccl_ep.a compilation as basically a part of compiling libtransformer_engine.so. However, CMakeExtension is a very general class for CMake projects and I'm not sure how messy it would be to make add the NCCL EP compilation logic.
| ) | ||
| gencode = " ".join(f"-gencode=arch=compute_{a},code=sm_{a}" for a in arch_list) | ||
|
|
||
| nproc = os.cpu_count() or 8 |
There was a problem hiding this comment.
This seems likely to break users with wimpy nodes. We can reuse an existing API to control build parallelism:
| nproc = os.cpu_count() or 8 | |
| nproc = get_max_jobs_for_parallel_build() |
|
Pipeline 54600556 - all tests passed. |
timmoon10
left a comment
There was a problem hiding this comment.
This still has some minor issues, but deadlines are what they are and the changes are well-contained. We should incorporate suggestions into future PRs.
This reverts commit c3396ee. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Summary
First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.
Every network-bound payload tensor takes an optional
NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.Implementation
Public C API (
transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})Types:
NVTEEpGroupConfig,NVTEEpLayerConfig,NVTEEpHandle,NVTECommWindow(side-band{ncclWindow_t window, size_t offset}; NCCL peer handles are not carried onNVTETensor).Lifecycle (host-only, eager):
nvte_ep_initialize— borrow an externalncclComm_tfor the EP sub-group and init the singleton backend.nvte_ep_shutdown— tear down the backend; idempotent; does not destroyep_comm.nvte_ep_register_layer— reserve ahandle_idfor a layer config and report thehandle_membuffer size the caller must allocate. The pair{id, mem}becomes the per-stepNVTEEpHandle.Per-step (allocation-free, CUDA-graph capturable)
nvte_ep_prepare— all-gather the routing map and write routing maps tohandle.mem.nvte_ep_dispatch— scatter tokens and routing weights from source ranks to expert ranks.tokens,topk_weights,recv_tokens,recv_topk_weightseach accept an optional symm-mem window for zero-copy.nvte_ep_combine— scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies byrecv_topk_weights).expert_outaccepts a window.nvte_ep_dispatch_bwd— backward of dispatch; routes token and weight grads back to source.gradandg_recv_topk_weightsaccept windows; the gathered outputs (grad_tokens,grad_topk_weights).nvte_ep_combine_bwd— backward of combine;gradandgrad_expert_outaccept windows. Padded slots ingrad_expert_outare zeroed.Backend + build
transformer_engine/common/ep/):EPBackendsingleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helpermake_payload_tensor()builds the per-callncclEpTensor_t: when the caller'sNVTECommWindow.window != nullptrit setswin_hdl+win_offset(zero-copy); otherwise it setsdatafromnvte_tensor_data(t)(HBM fallback).EPBackend::initialize): SM>=90 (viacudaDeviceGetAttribute), NCCL>=2.30.4 (viancclGetVersion), CUDA multicast/NVLS support.NVTE_WITH_NCCL_EP=OFF,ep/ep_api_stub.cppprovides throwingnvte_ep_*stubs so framework bindings link unconditionally; failure surfaces at firstnvte_ep_initialize.setup.pybuildslibnccl_ep.sofrom3rdparty/ncclby default; auto-disables NCCL EP when no requested CUDA arch >= 90. ExplicitNVTE_BUILD_WITH_NCCL_EP=1with all archs < 90 is treated as user errorNVTE_BUILD_WITH_NCCL_EP=0to opt out.NCCL_HOMEresolved dynamically: explicit env →/opt/nvidia/nccl,/usr/local/nccl,/usr→ldconfig -pfallback.Testing
tests/cpp_distributed/.Type of change
Checklist: