Skip to content

[Common/PyTorch] Grouped-quantize kernels for 1D and 2D FP8 block-scaling#3135

Open
denera wants to merge 10 commits into
NVIDIA:mainfrom
denera:common/fp8-block-scaling-grouped-quantize
Open

[Common/PyTorch] Grouped-quantize kernels for 1D and 2D FP8 block-scaling#3135
denera wants to merge 10 commits into
NVIDIA:mainfrom
denera:common/fp8-block-scaling-grouped-quantize

Conversation

@denera

@denera denera commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Description

Implements grouped-tensor quantize for the FP8 1D (1x128) and 2D (128x128) block-scaling recipes in row-wise (RW), column-wise (CW) and BOTH quantization directions. A single CUDA kernel launch walks 128x128 tiles across every tensor in the group, with each CTA decoding its owning tensor from the device-side GroupedTensor metadata with (N, R, K) shapes. Supports SAME_BOTH_DIMS (all tensors identical) and VARYING_FIRST_DIM (constant K, varying R) shape representations.

Three kernels share the dispatcher in group_quantize_blockwise_{1d,2d}:

  • group_block_scaled_1d_rw_kernel — RW-only dispatch; 8 threads/row, reads global memory directly into vec-16 registers; bypasses TMA because the shared memory roundtrip and ptx::mbarrier does not buy anything without re-use in CW path.
  • group_block_scaled_1d_tma_kernel — CW-only and BOTH dispatch; TMA bulk-load fills shared memory input cache. BOTH runs RW pass first (8 threads/row, vec-16 read from shared memory) then CW pass (2 threads/column, 64-row register stage); CW-only skips the RW pass. CW path writes the transposed-FP8 tile to a shared memory transpose staging buffer, then drains to global memory.
  • group_block_scaled_2d_tma_kernel — RW-only, CW-only and BOTH dispatch; TMA bulk-load fills shared memory input cache. Pass 1 stages 8 IVecs/thread in registers while computing the per-tile scalar amax. Pass 2 quantizes from registers, emits row-wise output, stages column-wise output to shared memory transpose staging buffer, then drains to global memory.

Kernels are gated to Hopper (sm_90) at the host dispatcher (cuBlasLt grouped GEMM supports FP8 block-scaling only on Hopper).

PR includes PyTorch integration.

JAX integration is intentionally left out-of-scope and deferred to a follow-up PR because it requires non-trivial new scaffolding on the framework side.

Resolves #2525

Performance

Table below measures performance on H200 with a sweep of grouped tensors in (N, M, K) shapes with:

  • N ∈ {4, 8, 16, 32, 64, 128} (# of device-local experts)
  • M = 4096 @ N = 4 —> M = 128 @ N = 128 (# of tokens/expert, scaling inversely with # of experts)
  • K ∈ {1024, 1792, 2048, 3584, 4096, 7168} (device-local shard of TP-hidden/intermediate-FFN dim)

The shapes are split into two buckets:

  • Small/Unsaturated (S): N x M x K <= 32M (< 2048 tiles and < 15 waves on H200's 132 SMs)
  • Large/Saturated (L): N x M x K > 32M (> 2048 tiles with enough work to keep SMs busy across many waves)

Reported kernel times and throughput ratios are bucket medians.

Speedup is measured relative to the split-quantized fallback that loops over the grouped tensor and sequentially quantizes each one.

% of "mono" throughput is measured relative to the throughput of a single non-grouped FP8 block-scaling quantize kernel invoked with the equivalent monolithic (NxM, K) tensor where the # of experts are collapsed with # of tokens/expert.

Bucket Path Grouped (ms) Split (ms) Speedup % memcpy tput % mono tput
S 1D RW 0.028 0.082 4.53× 76.5 % 117.2 %
S 1D CW 0.031 0.089 4.44× 66.1 % 116.9 %
S 1D BOTH 0.044 0.116 4.04× 63.5 % 115.4 %
S 2D RW 0.027 0.075 4.25× 74.2 % 99.7 %
S 2D CW 0.028 0.086 4.74× 72.3 % 128.9 %
S 2D BOTH 0.037 0.088 3.66× 74.5 % 97.6 %
L 1D RW 0.056 0.195 2.24× 88.9 % 119.9 %
L 1D CW 0.065 0.211 2.10× 79.9 % 122.1 %
L 1D BOTH 0.093 0.281 1.94× 74.0 % 118.4 %
L 2D RW 0.056 0.177 2.01× 88.6 % 99.6 %
L 2D CW 0.059 0.211 2.22× 85.8 % 135.0 %
L 2D BOTH 0.078 0.210 1.69× 84.2 % 99.1 %
# experts (N) S bucket L bucket
4 1.67× 1.45×
8 2.51× 1.49×
16 4.34× 1.97×
32 5.66× 2.92×
64 10.08× 6.40×
128 20.18× 9.06×

Notes

  • % of mono throughput is roughly consistent across buckets for every path, confirms no per-expert overhead in the new kernels.
  • Greater than 100% mono throughput cases are due to TMA bulk-loads, register staging and and vec-16 reads missing from the non-grouped FP8 block-scaling kernels.
  • Speedup over split-quantize scales as expected with # of experts (roughly linearly in the unsaturated regime) .

Known Sub-Optimalities

1D CW has bank conflicts on ~35% of load wavefronts (reading from the shared memory input-cache)

  • No possible stride padding or XOR swizzle to alleviate.
  • TMA hardware swizzle with CU_TENSOR_MAP_SWIZZLE_128B has the right pattern but caps FP16/BF16 at 64-elements; does not fit the 128-element tile for FP8 block-scaling without doubling per-tile launch overhead (quadrupling for FP32).
  • Threading restructure shifts bottleneck with no perf gain. Increasing threads/column loses the savings to additional cross-warp amax reduction plus sync. Decreasing to thread/column collapses occupancy to 1 CTA/SM under higher register pressure and shared memory footprint.

1D BOTH reads the shared memory input-cache twice

  • The RW (8 threads/row) and CW (2 threads/column) passes have different threading.
  • Attempted to unify with 8 threads/row for both RW and CW. Caused bank conflicts on ~76% of store wavefronts (writing to the shared memory transpose buffer), reduced to ~43% with a XOR swizzle but not enough to beat separate RW/CW passes.
  • Did not pursue the 2 threads/column unification; costs 40x more shfl ops than 8 threads/row attempt, plus a shared memory partial buffer and sync.

2D CW/BOTH has bank conflicts on ~16% of store wavefronts (when writing to the shared memory transpose buffer)

  • Already reduced from ~75% via a XOR swizzle, further reduction was not possible.
  • Minimal impact (< 5%) on kernel time.

No TMA-store

  • MXFP8 grouped quantize kernel leverages this by decomposing a 128x128 tile into 32-row sub-stages that each have their own independent 32x1 or 1x32 scale; shared memory footprint is based on a single sub-stage; can be quantized and TMA-stored independently; hides TMA-store of one stage under the compute of next stage.
  • FP8 block-scaling 128-element scale-block spans the entire 128-row tile. Cannot decompose into independent sub-stages and pipeline the TMA-stores. Single non-pipelined TMA-store requires holding the transposed staging buffer for the entire tile until all work on tile is finished, blows up shared memory footprint, collapses occupancy to 2CTA/SM. The recipe itself is the roadblock.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Implements grouped-tensor quantize for the FP8 1D (1x128) and 2D (128x128)
block-scaling recipes. A single CUDA kernel launch walks 128x128 tiles
across every tensor in the group, with each CTA decoding its owning
tensor from the device-side GroupedTensor metadata.

Supported shape representations:
  - SAME_BOTH_DIMS (all tensors identical)
  - VARYING_FIRST_DIM (constant K, varying R - the common MoE topology)

Supported directions: rowwise-only, columnwise-only, and both.

These kernels are gated to Hopper (sm_90) at the host dispatcher because
the consumer cuBLAS FP8 block-scaling *grouped* GEMM is itself
Hopper-only (cuBLAS does not provide native FP8 block-scaling grouped
GEMM on Blackwell; the recommended quantization recipe on Blackwell is
MXFP8). The device-side kernel bodies are gated on __CUDA_ARCH__ >= 900
so the kernels compile and link as part of multi-arch builds, but the
host gate prevents launches on Blackwell.

Three kernels share the dispatcher in
group_quantize_blockwise_{1d,2d}:

| Kernel | Dispatched when | Threading | Smem |
|--------|-----------------|-----------|------|
| group_block_scaled_1d_rw_kernel  | 1D RW-only       | 8 threads/row x 32 row-warps x 4 iters; reads gmem directly into vec-16 registers | none |
| group_block_scaled_1d_tma_kernel | 1D CW or 1D BOTH | TMA bulk-load fills 32 KB input cache. BOTH runs RW pass first (8 t/row, vec-16) then CW pass (2 t/col, 64-row register stage); CW-only skips the RW pass. CW writes the transposed-FP8 tile to a 16.5 KB smem_T staging buffer, then drains to gmem. | 32 KB + 16.5 KB |
| group_block_scaled_2d_tma_kernel | 2D RW / CW / BOTH | TMA bulk-load fills 32 KB cache. Pass 1 stages 8 IVecs/thread in registers while computing the per-tile scalar amax. Pass 2 quantizes from registers, emits rowwise output, stages columnwise output to smem_T, then drains. | 32 KB + 16.5 KB |

The RW-only 1D path bypasses TMA because a streaming read has no reuse
- the smem round-trip and mbarrier overhead would just add latency.

The C++ test tests/cpp/operator/test_cast_float8blockwise_grouped.cu
exercises 72 configurations covering RW/CW/BOTH x 1D/2D x SAME/VARYING
shape representations against a per-tensor split-quantize reference.

Signed-off-by: Alp Dener <adener@nvidia.com>
@denera denera requested review from ptrendx and vthumbe1503 June 17, 2026 13:01
@denera denera self-assigned this Jun 17, 2026
@denera denera added performance Performance issues FP8 MoE labels Jun 17, 2026
constexpr int kThreadsPerBlock = 256;
constexpr int kNumWarps = kThreadsPerBlock / kThreadsPerWarp;

// Align a dynamic-smem pointer to 128 bytes (TMA requirement).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse the existing align_smem_ptr_per_TMA_requirements() helper from transformer_engine/cast/core/common.h here?

size_t total_row_blocks) {
using namespace transformer_engine::dispatch::mxfp8::swizzle;
const size_t num_tiles_X =
(total_row_blocks + GEMM_SWIZZLED_SCALE_TILE_DIM_X - 1) / GEMM_SWIZZLED_SCALE_TILE_DIM_X;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also reuse the existing DIVUP() helper here (defined in transformer_engin/common/common.h).


// ---- Tensor-lookup helpers ----------------------------------------------------

// Map a global tile-row index to its owning tensor by binary-searching

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also reuse the existing get_current_tensor_id() helper defined in transformer_engine/cast/core/common.cuh

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds fused grouped-quantize CUDA kernels for FP8 1D (1×128) and 2D (128×128) block-scaling, targeting Hopper (SM90–SM99). A single kernel launch walks 128×128 tiles across every tensor in a group, decoding per-expert tensor metadata from device-side GroupedTensor structures. The implementation also adds matching dequantize kernels, PyTorch dispatch plumbing, and test coverage for all (recipe × direction × usage) combinations.

  • Three kernels share the dispatcher: group_block_scaled_1d_rw_kernel (RW fast path, no smem), group_block_scaled_1d_tma_kernel (CW/BOTH, TMA-loaded smem input cache), and group_block_scaled_2d_tma_kernel (all paths; two-pass amax→quantize with register staging).
  • grouped_linear.py is refactored: NVFP4 is removed from the fused grouped-GEMM path, the CC guard for that path is raised to SM100+, and bgrad_group_quantize is extended to also support Float8BlockwiseQuantizers.

Confidence Score: 5/5

Safe to merge; the new CUDA kernels are well-structured, per-expert scale layouts are consistent between quantize and dequantize, and host-side allocations correctly account for per-expert alignment slack.

The three new kernels cover all (1D/2D × RW/CW/BOTH) combinations with consistent layout logic verified across quantize and dequantize sides. PyTorch dispatch wiring, scale-buffer sizing, and the force_pow_2_scales guard are all correct. The CC guard change in grouped_linear.py and the smem-attribute condition are non-blocking concerns that do not affect correctness or stability.

grouped_linear.py for the dropped Hopper CC guard; group_quantize_fp8_blockwise.cuh for the cudaFuncSetAttribute condition.

Important Files Changed

Filename Overview
transformer_engine/common/cast/fp8_blockwise/group_quantize_fp8_blockwise.cuh New 1D+2D grouped-quantize kernels (RW, CW, BOTH); per-expert scale layout helpers are correct; the cudaFuncSetAttribute condition checks total smem instead of the dynamic-only portion, though this is harmless.
transformer_engine/common/cast/fp8_blockwise/group_dequantize_fp8_blockwise.cuh New dequantize kernels that mirror the quantize layouts; scale and data indexing match the write side correctly for all four (1D/2D × RW/CW) combinations.
transformer_engine/pytorch/module/grouped_linear.py Refactors _is_grouped_tensor_path_supported: raises CC guard to SM100+, dropping Hopper BF16/FP16 and NVFP4 from the fused path; extends bgrad_group_quantize to all FP8 quantizers; NVFP4 import removed.
transformer_engine/pytorch/csrc/quantizer.cpp Adds create_grouped_output for Float8BlockQuantizer with correct scale buffer sizing (including 2D CW slack), force_pow_2_scales rejection, and compact (un-swizzled) scale layout.
transformer_engine/pytorch/csrc/extensions/cast.cpp Adds FP8_BLOCKWISE_GROUPED_QUANTIZE dispatch branch; fixes scale_dtype for block-scaling dequantize (kFloat32 instead of kFloat8E8M0); extends bgrad_group_quantize guard to include Float8Blockwise.
transformer_engine/common/util/ptx.cuh Lowers mbarrier/TMA PTX guards from SM100 to SM90 and adds cp_async_bulk_tensor_2d_global_to_shared_cta (shared::cta variant needed by the new Hopper kernels).
transformer_engine/common/cast/dispatch/quantize.cuh Wires NVTE_BLOCK_SCALING_1D and NVTE_BLOCK_SCALING_2D into group_quantize_fwd_helper and group_quantize_bwd_helper; IS_ACT / IS_DACT guards correctly reject unimplemented paths.
tests/cpp/operator/test_cast_float8blockwise_grouped.cu New C++ test covers SAME_BOTH_DIMS and VARYING_FIRST_DIM for all four kernel paths; swizzled-scale path is still unexercised (noted in a prior review thread).
tests/pytorch/test_grouped_tensor.py Adds comprehensive fp8_blockwise grouped quantize / dequantize / bgrad tests alongside the existing mxfp8 and nvfp4 cases; arch-skip marks correctly gate by SM tier.
transformer_engine/common/cast/swizzle.cuh Moved from cast/mxfp8/ to cast/ and namespace updated from mxfp8::swizzle to dispatch::swizzle; call-sites in group_quantize_mxfp8.cuh and cublaslt_grouped_gemm.cu updated accordingly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[group_quantize / bgrad_group_quantize] --> B{scaling_mode?}
    B -->|NVTE_BLOCK_SCALING_1D| C[group_quantize_blockwise_1d]
    B -->|NVTE_BLOCK_SCALING_2D| D[group_quantize_blockwise_2d]
    B -->|NVTE_MXFP8| E[group_quantize_mxfp8]
    B -->|NVTE_NVFP4| F[group_quantize_transpose_nvfp4]
    C --> G{use_colwise or dbias?}
    G -->|No| H[group_block_scaled_1d_rw_kernel no smem vec-16 global loads]
    G -->|Yes| I[group_block_scaled_1d_tma_kernel TMA bulk-load smem input cache]
    D --> J[group_block_scaled_2d_tma_kernel Pass 1 tile amax via register staging Pass 2 quantize + smem_T transpose drain]
    I --> K[RW pass 8 t/row vec-16 from smem rowwise gmem]
    I --> L[CW pass 2 t/col 64-row regs smem_T colwise gmem]
    J --> M[Rowwise gmem write from registers]
    J --> N[smem_T XOR-swizzle stage colwise gmem drain]
    C & D --> O[grouped_reduce_dbias if dbias requested]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[group_quantize / bgrad_group_quantize] --> B{scaling_mode?}
    B -->|NVTE_BLOCK_SCALING_1D| C[group_quantize_blockwise_1d]
    B -->|NVTE_BLOCK_SCALING_2D| D[group_quantize_blockwise_2d]
    B -->|NVTE_MXFP8| E[group_quantize_mxfp8]
    B -->|NVTE_NVFP4| F[group_quantize_transpose_nvfp4]
    C --> G{use_colwise or dbias?}
    G -->|No| H[group_block_scaled_1d_rw_kernel no smem vec-16 global loads]
    G -->|Yes| I[group_block_scaled_1d_tma_kernel TMA bulk-load smem input cache]
    D --> J[group_block_scaled_2d_tma_kernel Pass 1 tile amax via register staging Pass 2 quantize + smem_T transpose drain]
    I --> K[RW pass 8 t/row vec-16 from smem rowwise gmem]
    I --> L[CW pass 2 t/col 64-row regs smem_T colwise gmem]
    J --> M[Rowwise gmem write from registers]
    J --> N[smem_T XOR-swizzle stage colwise gmem drain]
    C & D --> O[grouped_reduce_dbias if dbias requested]
Loading

Reviews (4): Last reviewed commit: "Add grouped FP8 block-scaling dequantize..." | Re-trigger Greptile

Comment thread tests/cpp/operator/test_cast_float8blockwise_grouped.cu
}

CType amax = compute_row_amax<IType, CType, kVec>(in_vec[it]);
amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 1));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse the existing amax warp-reduction helpers (warp_reduce_max() or reduce_max()) from transformer_engine/common/utils.cuh here?

Comment on lines +535 to +537
amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 1));
amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 2));
amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 4));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also reuse reduce_max() or warp_reduce_max() here.


// ----- Host-side dispatchers --------------------------------------------------------------------

inline size_t align_up_to(size_t x, size_t a) { return ((x + a - 1) / a) * a; }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reuse DIVUP_TO_MULTIPLE() defined in transformer_engine/common/common.h.

NVTE_CHECK(info.tensor_offsets_d != nullptr,
"VARYING_FIRST_DIM requires tensor_offsets to be set on the GroupedTensor.");
}
info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim;
info.total_row_blocks = DIVUP(info.R_total, kTileDim);

"VARYING_FIRST_DIM requires tensor_offsets to be set on the GroupedTensor.");
}
info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim;
info.blocks_X = (info.K + kTileDim - 1) / kTileDim;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info.blocks_X = (info.K + kTileDim - 1) / kTileDim;
info.blocks_X = DIVUP(info.K, kTileDim);

info.same_both_dims = same_both_dims;
info.num_tensors = output->num_tensors;
info.K = output->get_common_last_dim();
NVTE_CHECK(info.K % 16 == 0, "Last dim must be multiple of 16 (FP8 alignment).");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a TMA requirement, we can use the TMA_GMEM_ALIGNMENT constant defined in transformer_engine/common/common.h

const float* noop_ptr =
(noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr;

const size_t scale_stride_y = align_up_to(info.blocks_X, 4);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const size_t scale_stride_y = align_up_to(info.blocks_X, 4);
const size_t scale_stride_y = DIVUP_TO_MULTIPLE(info.blocks_X, 4);

const size_t scale_stride_y = align_up_to(info.blocks_X, 4);
// CW scales are stored [blocks_X, align4(total_row_blocks)] -- transposed to
// match the physically-transposed columnwise data the TN cuBLAS GEMM consumes.
const size_t scale_t_stride_y = align_up_to(info.total_row_blocks, 4);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const size_t scale_t_stride_y = align_up_to(info.total_row_blocks, 4);
const size_t scale_t_stride_y = DIVUP_TO_MULTIPLE(info.total_row_blocks, 4);

const float* noop_ptr =
(noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr;

const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4);
const size_t scale_stride_aligned_R = DIVUP_TO_MULTIPLE(info.R_total, 4);

(noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr;

const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4);
const size_t scale_t_stride_aligned_K = align_up_to(info.K, 4);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const size_t scale_t_stride_aligned_K = align_up_to(info.K, 4);
const size_t scale_t_stride_aligned_K = DIVUP_TO_MULTIPLE(info.K, 4);

denera and others added 2 commits June 22, 2026 22:49
- Reuse shared helpers (DIVUP, DIVUP_TO_MULTIPLE, TMA_GMEM_ALIGNMENT,
  align_smem_ptr_per_TMA_requirements, get_current_tensor_id,
  subwarp_reduce_max_broadcast) in place of local equivalents.
- Add proxy-async fence after mbarrier_init in 2D + 1D TMA kernels.
- Enforce per-tensor first_dim % 128 device-side for VARYING_FIRST_DIM
  (matches MXFP8 grouped quantize behavior).
- Fix Hopper SM range wording in 1D dispatcher.
- Extend cpp tests to cover with_gemm_swizzled_scales path.

Signed-off-by: Alp Dener <adener@nvidia.com>
@denera denera requested a review from Oleg-Goncharov June 22, 2026 23:06
// num_tiles_X = DIVUP(total_row_blocks, TILE_DIM_X=4)
__device__ __forceinline__ size_t swizzled_colwise_scale_idx(size_t i, size_t j,
size_t total_row_blocks) {
using namespace transformer_engine::dispatch::mxfp8::swizzle;

@vthumbe1503 vthumbe1503 Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename the namespace for swizzle...given that we use the same constants for mxfp8, nvfp4, fp8 block scaling

denera added 6 commits June 23, 2026 05:20
The swizzle helpers are shared across MXFP8, NVFP4, and FP8 block scaling.
Relocate swizzle.cuh from cast/mxfp8/ to cast/ and drop the mxfp8::
namespace layer so callers don't reach across precisions.

Signed-off-by: Alp Dener <adener@nvidia.com>
create_grouped_tensor hardcoded with_gemm_swizzled_scales=false, so the
swizzled-scale kernel branch was unreachable from the PyTorch grouped
quantize API even when the quantizer requested optimize_for_gemm. Plumb
optimize_for_gemm into both the C++ wrapper and the Python GroupedTensor
kwarg, and size the colwise scale buffer for the swizzled layout via a
new with_gemm_swizzled_scales parameter on get_scale_shape.

Add a parameterized PyTorch test that asserts the flag propagation
end-to-end so a future regression of this kind fails loudly.

Also fixes two swizzle.cuh includes missed in the prior namespace-rename
commit (mxfp8/dequantize_mxfp8.cuh + mxfp8/quantize_mxfp8.cuh).

Signed-off-by: Alp Dener <adener@nvidia.com>
…EMM swizzle

cuBLAS FP8 block-scaling grouped GEMM reads each expert's scales from a
contiguous compact per-expert sub-block, so the kernels write that layout
directly and size the scale buffers from the logical totals — no per-tensor
device->host copy, which keeps allocation CUDA-graph-safe.

The MXFP8-style GEMM-swizzled scale layout is never consumed by FP8 block
scaling, so its kernel template, per-expert index helper, and dispatch are
removed. (The internal bank-conflict shared-memory swizzle is unrelated and
kept.)

Enable FP8 block scaling on the GroupedLinear CUDA-graph-safe path, Hopper-only
(cuBLAS rejects it on Blackwell, where MXFP8 is preferred).

Signed-off-by: Alp Dener <adener@nvidia.com>
Restore tests/pytorch/test_grouped_linear.py and
transformer_engine/pytorch/module/grouped_linear.py to their main-branch
state. The FP8 block-scaling GroupedTensor + Quantizer plumbing stays in
this branch; the module-level wiring will be revisited in a follow-up PR
alongside the cuBLAS H100 grouped-WGRAD miscompute investigation.

Signed-off-by: Alp Dener <adener@nvidia.com>
Adds three TEST_P variants (CudaGraphCapture, CudaGraphCaptureDiscreteOut,
CudaGraphCaptureDiscreteIn) that capture each of the three grouped GEMM
API entry points on a non-blocking stream, instantiate a CUDA graph, and
replay twice with per-replay output verification. Asserts the pure C++
implementation is graph-safe across all existing parametrizations
(recipes, layouts, shape cases, NULL-C, output dtypes).

Signed-off-by: Alp Dener <adener@nvidia.com>
Extend the GroupedTensor FP8 block-scaling path to match MXFP8's grouped
quantize/dequantize/bgrad coverage on Hopper.

Dequantize:
* New group_dequantize_fp8_blockwise.cuh with kernels for all four modes
  (1D/2D x rowwise/columnwise), inverting the per-expert layouts that the
  grouped quantize kernels write. Wired into the group_dequantize dispatch.
* group_dequantize (PyTorch) derives the scale_inv dtype from the scaling
  mode (FP8BS -> Float32) instead of hardcoding E8M0.

Bias gradient:
* bgrad_group_quantize now accepts Float8Block quantizers. dbias is computed
  in-kernel as a per-tile column partial (mirroring MXFP8) and reduced per
  expert via the existing common::grouped_reduce_dbias; no separate reduction
  kernel. The TMA quantize kernels compute the partial from the smem-resident
  tile; RW-only-with-dbias is routed through the TMA kernel.

Scale constraints:
* The fused grouped FP8BS path supports only unconstrained FP32 scales.
  force_pow_2_scales=True is rejected in Float8BlockQuantizer::create_grouped_tensor,
  and the pow_2_scaling code path is stripped from the grouped quantize kernels.
  Power-of-2 scales remain available on the non-grouped / unfused split-quantize
  path (used for the Blackwell MXFP8 emulation), which is unchanged.

Tests:
* Consolidate test_grouped_tensor.py so MXFP8, NVFP4, and FP8 block scaling
  share parametrized quantize / dequantize / bgrad tests across tensor-usage
  directions, block-scaling dims, and output dtypes.

Signed-off-by: Alp Dener <adener@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FP8 MoE performance Performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blockwise (1x128 and 128x128) FP8 grouped quantization

3 participants