Full MXFP4 Training Recipe by sarthak-amd · Pull Request #537 · ROCm/TransformerEngine

sarthak-amd · 2026-04-13T19:01:58Z

Summary

Introduces nvte_cast_transpose_mxfp4_fused_shuffle — fused HIP kernel for MXFP4 cast+transpose with optional Hadamard transform and memory-layout shuffling for GEMM.
fp4_gemm_handler that dispatches A4W4 GEMM calls to the AITER backend with the right layouts.
MXFP4 weight caching in Linear.forward() and LayerNormLinear.forward() that persists quantized MXFP4TensorStorage weights across forward passes.

Training recipe can be enabled using:

export FP4_RECIPE=mxfp4
export NVTE_MXFP4_USE_HADAMARD=1
export FP4=True

MXFP4 Flow:

Loss graph:

Test plan

CI passes (linkage error resolved)
C++ gtest: test_cast_mxfp4_transpose.cu passes on gfx950
mxfp4_cpp_gfx942.log
mxfp4_cpp_gfx950.log
Python pytest: test_mxfp4_* tests pass on gfx950
mxfp4_all_gfx942.log
mxfp4_all_gfx950.log
End-to-end MXFP4 training with Megatron

…rd kernel Ports the MXFP4 training recipe from TE 2.8 (mxfp-fused-2.8) to the dev branch (TE 2.10). Uses the HIP-based cast+transpose kernel with fused shuffle and Hadamard transform for FP4 quantization, and AITER ASM a4w4 kernels for FP4 GEMM. Changes: - csrc: Add mxfp4_hip.cpp wrapper + cast_transpose_mxfp4_kernel_shuffled.cu - csrc: Register cast_transpose_mxfp4_fused_shuffle in pybind + extensions.h - tensor/mxfp4_tensor.py: Replace Triton quantize with HIP kernel, add quantize_impl(), respect USE_HADAMARD env var - module/linear.py, layernorm_linear.py: Add MXFP4 weight cache with rowwise-only optimization and quantized-norm bypass - module/fp4_handler_gemm.py: AITER ASM a4w4 GEMM handler with shape-aware kernel selection and float4_e2m1fn_x2 dtype conversion - cpp_extensions/gemm.py: Route MXFP4TensorStorage to fp4_handler_gemm - build_tools/pytorch.py: Collect .cu files for hipify/hipcc compilation Made-with: Cursor

Made-with: Cursor

- Updated .gitignore to exclude specific MXFP4 HIP files. - Introduced new scaling mode NVTE_MXFP4_1D_SCALING in common headers. - Enhanced scaling checks in CheckScaleTensorShape and CheckInputTensor functions to accommodate MXFP4. - Added MXFP4Quantizer class for handling MXFP4 tensor quantization, including tensor creation and parameter setting. - Updated quantization implementation in MXFP4Quantizer to utilize a new quantization function based on environment settings. This commit improves the handling of MXFP4 tensors and their quantization process, ensuring compatibility with the latest scaling modes and tensor operations.

For LoRA SFT, frozen base weights have is_grad_enabled=False so columnwise quantization is skipped (no wgrad needed). For full pretraining all weights have gradients so this is a no-op. Made-with: Cursor

- Introduced MXFP4 quantization logic in the quantization dispatch, including Hadamard transform and data shuffling options. - Added new MXFP4 tensor storage and quantizer classes to manage MXFP4 data formats and operations. - Updated CMakeLists.txt to include new MXFP4 source files and dependencies. - Enhanced common headers to define new quantization configuration attributes for MXFP4. - Implemented MXFP4 quantization in the PyTorch interface, allowing for flexible tensor operations. This commit significantly improves the MXFP4 support in the transformer engine, enabling advanced quantization techniques and optimizing performance for AMD architectures.

Move FP4 AITER GEMM handler from module/fp4_handler_gemm.py into cpp_extensions/gemm.py alongside other GEMM dispatch paths. Remove mxfp4_hip.cpp which is no longer needed after the nvte_quantize_v2 refactor.

handling for MXFP4 alongside existing FP8/NVFP4 paths. Expose MXFP4 device support via check_mxfp4_support / is_mxfp4_available (and FP8GlobalStateManager), validate MXFP4 in check_recipe_support, and return a larger alignment when recipe.mxfp4(). Teach TransformerEngineBase.set_meta_tensor and LayerNormMLP activation fusion gating to treat MXFP4 like other non-fused recipes. Extend test_numerics fp8_recipes with MXFP4BlockScaling when supported. Add default fp8_format on MXFP4BlockScaling for callers expecting recipe.fp8_format.

…cipe-212-refactor

- Introduced `is_mxfp4_available` import in the PyTorch interface. - Added `check_fp8_block_scaling_support` function to validate FP8 block scaling availability based on device compute capability and CUDA version. - Cleaned up imports in `gemm.py` by moving them to appropriate locations.

- Implemented support for new MXFP4 quantization configuration attributes: `mxfp4_use_hadamard` and `mxfp4_shuffle`. - Updated `nvte_get_quantization_config_attribute` and `nvte_set_quantization_config_attribute` functions to handle these new attributes, enhancing the flexibility of quantization settings in the transformer engine.

- Introduced a new helper function `_round_up` to round values up to the nearest multiple, aiding in scale padding. - Updated the MXFP4 quantization process to pad scales to match the native allocator layout, ensuring compatibility with the expected tensor dimensions. - Adjusted the handling of scales in the `MXFP4QuantizerRef` class to accommodate padded scales, improving the robustness of the quantization process.

The LAUNCH_KERNEL macro hardcoded SHUFFLE_SCALES=true, ignoring the runtime shuffle_scales parameter. This caused scales to be written in shuffled layout even when shuffle was disabled, producing incorrect output. Refactor dispatch into do-while-wrapped macros to also fix dangling-else issues. Add tolerant comparison helpers to the MXFP4 quantize test for C++/HIP backend rounding differences (±1 nibble).

…ions - Updated the QuantizationConfig structure to include separate flags for scale and data shuffling. - Modified the MXFP4 quantization logic to utilize these new flags, enhancing flexibility in quantization settings. - Adjusted related functions and classes to accommodate the new shuffling parameters, ensuring correct behavior during quantization. - Updated tests and kernel dispatch to reflect these changes, improving the overall robustness of the MXFP4 quantization process.

…fling options - Introduced a new function `un_shuffle_scales` to invert the AITER scale shuffle permutation, improving test accuracy. - Updated `check_quantization_mxfp4_versus_reference` to conditionally unshuffle scales based on the `shuffle_scales` parameter, ensuring correct comparisons. - Added new parameters for `use_hadamard`, `shuffle_B_matrix_for_aiter`, and `shuffle_scales` in the quantization tests to enhance flexibility and coverage. - Implemented a new function `_shuffle_fp4_data` in the quantization logic to support shuffling of packed FP4 data for AITER GEMM kernels. - Adjusted the `MXFP4QuantizerRef` class to utilize the new shuffling function, ensuring compatibility with the updated quantization process.

- Modified the test to view rowwise scales as `torch.uint8` for consistency. - Implemented conditional unshuffling of scales based on the `shuffle_scales` parameter, enhancing test accuracy and flexibility. - Ensured that both contiguous and non-contiguous scale tensors are correctly compared in the quantization tests.

- Introduced a new test file `test_mxfp4_gemm_exact.py` to validate the accuracy of MXFP4 GEMM operations against a Python reference implementation. - Implemented a parameterized test function `test_mxfp4_gemm_versus_reference` to cover various matrix dimensions and data types. - Enhanced the quantization process by integrating native MXFP4 quantization and reference quantization methods, ensuring robust comparisons. - Added checks for NaN values and ensured proper handling of output accumulation in the tests.

…ubmodule commit.

… nightlies

…n fused attention setting

…all based on fused attention setting

…l only occurs if installation succeeds

…ter.

sudhu2k · 2026-04-28T03:34:49Z

All level 3 tests passed with the new aiter installed image:
https://github.com/ROCm/TransformerEngine/actions/runs/25014847020/job/73259852413

sarthak-amd added 4 commits April 13, 2026 13:10

cast transpose mxfp4 namespace change

8f4d6c7

mxfp4 fused shuffle test with RHT

c00c7d4

add mxfp4 python tests

056ed0c

Made-with: Cursor

sarthak-amd requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 13, 2026 19:01

sudhu2k added the ci-level 3 CI test level 3 label Apr 13, 2026

sarthak-amd and others added 22 commits April 13, 2026 14:42

add MXFP4Quantizer.copy()

a7986fd

Made-with: Cursor

fix non_tensor_args unpack for grad_output_quantizer_mxfp4

71a5ce0

Made-with: Cursor

restore use_hadamard as instance attribute

f083396

Made-with: Cursor

guard _is_mxfp4_enabled()

2aa87bf

Made-with: Cursor

fix hip intrinsics and imports

a5af3de

gate MXFP4 columnwise quantization on is_grad_enabled

7775ce5

For LoRA SFT, frozen base weights have is_grad_enabled=False so columnwise quantization is skipped (no wgrad needed). For full pretraining all weights have gradients so this is a no-op. Made-with: Cursor

Consolidate MXFP4 GEMM dispatch and remove dead code

1e6a4f7

Move FP4 AITER GEMM handler from module/fp4_handler_gemm.py into cpp_extensions/gemm.py alongside other GEMM dispatch paths. Remove mxfp4_hip.cpp which is no longer needed after the nvte_quantize_v2 refactor.

Merge remote-tracking branch 'origin/dev' into sudhu/feature/mxfp4-re…

236756b

…cipe-212-refactor

Revert changes to layernorm linear and linear files

7f4f706

Remove redundant pybinds and minor bug fixes

e0627e9

Added mxfp4 quantize exact unit test

5647636

Rename shuffle_scales to with_gemm_swizzled_scales

3f75509

sudhu2k requested a review from wangye805 April 22, 2026 17:47

Update aiter commit and update pytorch.sh to install aiter from the s…

1df9c40

…ubmodule commit.

sudhu2k added ci-level 1 CI test level 1 and removed ci-level 3 CI test level 3 labels Apr 22, 2026

sudhu2k added 2 commits April 22, 2026 21:49

triton mxfp4 cast test bug fix and modify installation of aiter in CI

da8480f

Update aiter installation in CI to include flydsl dependency from AMD…

b68935b

… nightlies

wangye805 approved these changes Apr 23, 2026

View reviewed changes

ipanfilo reviewed Apr 23, 2026

View reviewed changes

Addressed comments

90f9b25

sudhu2k requested a review from ipanfilo April 23, 2026 21:34

ipanfilo reviewed Apr 23, 2026

View reviewed changes

Comment thread ci/pytorch.sh Outdated

Refactor mxfp4 test execution to conditionally use with_aiter based o…

52a71f9

…n fused attention setting

sudhu2k requested a review from ipanfilo April 23, 2026 22:03

sudhu2k added 2 commits April 23, 2026 22:06

Merge remote-tracking branch 'origin/dev' into feature/mxfp4-recipe-212

48ba2da

Refactor aiter installation in CI to conditionally install and uninst…

a62b7df

…all based on fused attention setting

ipanfilo reviewed Apr 23, 2026

View reviewed changes

Comment thread ci/pytorch.sh Outdated

Revert removal of fp8_format from MXFP4 recipe

e2fe39e

sudhu2k requested a review from ipanfilo April 24, 2026 02:27

ipanfilo requested changes Apr 24, 2026

View reviewed changes

Comment thread ci/pytorch.sh Outdated

Enhance aiter installation logic in CI to ensure conditional uninstal…

3fde1c0

…l only occurs if installation succeeds

sudhu2k requested a review from ipanfilo April 24, 2026 05:06

sudhu2k added 3 commits April 24, 2026 18:42

Revert aiter changes

f65892b

Merge remote-tracking branch 'origin/dev' into feature/mxfp4-recipe-212

5782621

Added comment to MXFP4BlockScaling Recipe about the fp8_format parame…

08490ba

…ter.

sudhu2k added ci-level 3 CI test level 3 and removed ci-level 1 CI test level 1 labels Apr 26, 2026

ipanfilo approved these changes Apr 27, 2026

View reviewed changes

sudhu2k merged commit dcfae3e into dev Apr 28, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full MXFP4 Training Recipe#537

Full MXFP4 Training Recipe#537
sudhu2k merged 65 commits intodevfrom
feature/mxfp4-recipe-212

sarthak-amd commented Apr 13, 2026 •

edited by sudhu2k

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhu2k commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sarthak-amd commented Apr 13, 2026 • edited by sudhu2k Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MXFP4 Flow:

Loss graph:

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhu2k commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sarthak-amd commented Apr 13, 2026 •

edited by sudhu2k

Loading