Add fsdp2 fp8 unit tests TE 2.10 by sudhu2k · Pull Request #492 · ROCm/TransformerEngine

sudhu2k · 2026-03-17T20:15:20Z

Description

This PR adds unit test covering different configurations such as:

delayed scaling + fp8 autocast only
delayed scaling + fp8 init only (new)
delayed scaling + fp8 init + fp8 autocast (new)
current scaling + fp8 autocast only
current scaling + fp8 init only (new)
current scaling + fp8 init + fp8 autocast (new)
MXFP8 scaling + fp8 autocast only
MXFP8 scaling + fp8 init only (new)
MXFP8 scaling + fp8 init + fp8 autocast (new)
fp32 (new)

All the unit tests compare FSDP2 vs DDP grads/output.

This PR also cleans up fsdp2_all_gather_tensor to match upstream's methods.

Removes keep_fp8_weight_transpose_cache dependency
Removes storing module reference to the tensor.

This PR also fixes issue with fused_adam when using it with FSDP2.

Fixes # (https://github.com/ROCm/frameworks-internal/issues/15291)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…g and refined test case generation for various configurations. - Cleaned up unused variables and improved code readability in the FSDPAGTensor class by removing unnecessary parameters.

… FusedAdam. Added debug print for DTensor in MultiTensorApply.

… tolerances for tensor comparisons. Updated test logic to accommodate new tolerance parameters for improved accuracy in floating-point comparisons.

…l differences in gradient calculations. Clean up unused debug print statements in MultiTensorApply and ensure proper newline at the end of the FSDPAGTensor serialization method.

sudhu2k · 2026-03-17T21:29:39Z

-        if not isinstance(quantizer, MXFP8Quantizer) and not self._keep_fp8_weight_transpose_cache:
+        quantizer = module.quantizers["scaling_fwd"][self._fp8_meta_index]
+        if not isinstance(quantizer, MXFP8Quantizer):
            quantizer.set_usage(columnwise=False)


For FSDP2 with FP8, keep_fp8_weight_transpose_cache should be False. Caching the transposed weight would imply an all-gather of the transposed tensor as well, increasing memory and communication and negating the advantages of FSDP2’s sharded parameter layout.

sudhu2k · 2026-03-17T21:35:37Z

+            data = torch.zeros_like(param, dtype=torch.int16)
        else:
-            data = torch.empty(param.shape, dtype=dtype, device=param.device)
+            data = torch.empty_like(param, dtype=dtype)


When using FSDP2, parameters are DTensors, and when we do torch.zeros() or torch.empty() we create regular pytorch Tensors.
This was causing
[rank1]: RuntimeError: aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

[rank7]: File "/workspace/TransformerEngine/transformer_engine/pytorch/optimizers/fused_adam.py", line 422, in initialize_state [rank7]: self.set_scaled_state(param, "master_param", param.clone().detach().float()) [rank7]: File "/workspace/TransformerEngine/transformer_engine/pytorch/optimizers/fused_adam.py", line 363, in set_scaled_state [rank7]: state[state_name].copy_(unscaled_state)

Fix:
Keep optimizer state consistent with the parameter type: when parameters are DTensors, state should be DTensors as well. Using torch.empty_like(param, ...) (and the same idea for other state buffers) creates state as a DTensor with the same placement as param, so both sides of copy_ are DTensors and the error is avoided.

Is it upstream fix cherry-picking?

Upstream fixes this in TEv2.12, along with few other fixes.
NVIDIA/TransformerEngine@fe8fad5#diff-0801a8d92a56d458946da1439b62e0add1613b7da83d31bc218a852b6b9e42b1
This wasn't cherry picked.

…by adding a newline character after the pass statement in the test_dummy function.

ipanfilo · 2026-03-18T02:26:34Z


        # Zero the parameter gradients
        optimizer.zero_grad()
-        with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):


Does with te.fp8_autocast(enabled=args.fp8_autocast,.. ) do the same?

It does do the same but since with TEv2.10, te.fp8_autocast is replaced with te.autocast, I've made the change to be consistent.

So will 'with te.autocast(enabled=args.fp8_autocast, recipe=...)' do the same as if/else?

Yes, it should. I'll make the changes.

ipanfilo · 2026-03-18T02:27:44Z

    assert len(l1) == len(l2), "Unequal number of outputs."
    for i, (t1, t2) in enumerate(zip(l1, l2)):
-        result = torch.allclose(t1, t2, atol=0, rtol=0)
+        tols = dict(atol=atol)


Move tolls calculation out of the loop

…s for improved clarity and consistency.

…_fix_2.10

Manually ported fix from upstream commit 139c863 The full commit was not cherry-picked due to unrelated changes across many files. Addressed PR comments

- Updated model loading to conditionally use weights_only based on quantized_init. - Modified optimizer initialization to remove master_weights parameter when not using quantized_init. - Improved test setup for quantized models by integrating quantized_model_init with appropriate scaling recipes. - Adjusted tolerance comments in tests to clarify FP8 behavior with FSDP2 and DDP. - Added checks in base module to skip FP8 reduction for FSDP2 based on primary weights status.

- Updated tolerance logic in assert_allclose to use the second tensor for relative tolerance calculations. - Adjusted tolerance values based on quantization initialization conditions to ensure accurate testing of FP8 behavior with FSDP2.

- Add preserving of amax/scales when copying fp32 tensor to already existing fp8 tensor. - Removed unnecessary model loading logic for quantized initialization in the training script since we already use same random seed. - Exclude quantized_init + non autocast testcase. - Updated comments to clarify tolerance handling in FP8 tests.

sudhu2k · 2026-04-10T21:14:30Z

+            # DDP broadcast path: _broadcast_coalesced dequantizes Float8Tensors
+            # (via aten::cat fallback) then copies the plain tensor back.
+            # Re-quantize while preserving the original quantizer state.
+            if isinstance(dst, Float8Tensor) and not isinstance(src, Float8Tensor):


When PyTorch's native DDP broadcasts parameters, broadcast_coalesced concatenates all module parameters (FP8 weights and FP32 biases) via aten::cat, which triggers Float8Tensor's fallback dispatch path and dequantizes the FP8 weights to FP32. After the broadcast, aten::copy writes the FP32 data back into the Float8Tensor parameter, calling quantize_() which recomputes amax and scale from the dequantized values — but these differ slightly from the originals due to FP8 round-trip error, causing numerical divergence from FSDP2. The fix intercepts this plain-tensor-to-Float8Tensor copy_ path, saves the quantizer's amax and scale before re-quantization, and restores them afterward, so the DDP broadcast becomes a no-op with respect to quantizer state. This makes the test pass for a model with layernormMLP, LayernormLinear and Linear module with 0 tolerance.

…_fix_2.10

- Added `linear_only` parameter to `SimpleNet` to allow usage of only the final linear layer. - Updated model initialization and forward pass logic to conditionally skip LayerNorm layers when `linear_only` is enabled. - Modified argument parsing to include `--linear-only` flag for test scripts, ensuring compatibility with quantized initialization scenario.

sudhu2k and others added 5 commits March 17, 2026 01:23

Initial commit

e8e63b1

- Updated test functions to include new parameters for FP8 autocastin…

c3e33e3

…g and refined test case generation for various configurations. - Cleaned up unused variables and improved code readability in the FSDPAGTensor class by removing unnecessary parameters.

Refactor quantizer state checks and optimize tensor initialization in…

db36143

… FusedAdam. Added debug print for DTensor in MultiTensorApply.

Refactor assertion function in FP8 tests to use relative and absolute…

13b4007

… tolerances for tensor comparisons. Updated test logic to accommodate new tolerance parameters for improved accuracy in floating-point comparisons.

Update test tolerances for FP8 configurations to account for potentia…

d91241f

…l differences in gradient calculations. Clean up unused debug print statements in MultiTensorApply and ensure proper newline at the end of the FSDPAGTensor serialization method.

sudhu2k commented Mar 17, 2026

View reviewed changes

Ensure proper newline at the end of the test_torch_fsdp2_fp8.py file …

2b8818d

…by adding a newline character after the pass statement in the test_dummy function.

sudhu2k marked this pull request as ready for review March 17, 2026 21:45

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 17, 2026 21:45

sudhu2k self-assigned this Mar 17, 2026

ipanfilo reviewed Mar 18, 2026

View reviewed changes

Refactor tolerance calculations.

8964d56

sudhu2k requested a review from ipanfilo March 18, 2026 15:10

sudhu2k added 2 commits March 18, 2026 16:19

Refactor model initialization and autocasting logic in FSDP2 FP8 test…

54938d9

…s for improved clarity and consistency.

Merge remote-tracking branch 'origin/dev' into sudhu/FSDP2_unit_tests…

f771955

…_fix_2.10

ipanfilo approved these changes Mar 18, 2026

View reviewed changes

alextmagro reviewed Mar 19, 2026

View reviewed changes

Fix FusedAdam DTensor state initialization for FSDP2

c1949d3

Manually ported fix from upstream commit 139c863 The full commit was not cherry-picked due to unrelated changes across many files. Addressed PR comments

alextmagro approved these changes Mar 19, 2026

View reviewed changes

wangye805 requested changes Mar 26, 2026

View reviewed changes

Comment thread tests/pytorch/distributed/test_torch_fsdp2_fp8.py

sudhu2k added 2 commits April 6, 2026 04:04

Merge branch 'dev' into sudhu/FSDP2_unit_tests_fix_2.10

dade028

sudhu2k requested a review from wangye805 April 7, 2026 03:43

sudhu2k added the ci-level 3 CI test level 3 label Apr 7, 2026

sudhu2k added 2 commits April 7, 2026 16:52

Refine tolerance calculations in FSDP2 FP8 tests

76a9d6a

- Updated tolerance logic in assert_allclose to use the second tensor for relative tolerance calculations. - Adjusted tolerance values based on quantization initialization conditions to ensure accurate testing of FP8 behavior with FSDP2.

sudhu2k commented Apr 10, 2026

View reviewed changes

sudhu2k added 2 commits April 13, 2026 17:04

Merge remote-tracking branch 'origin/dev' into sudhu/FSDP2_unit_tests…

271efd4

…_fix_2.10

Conversation

sudhu2k commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudhu2k commented Mar 17, 2026 •

edited

Loading