Skip to content

Microbenchmarking, Torch+CSV-based#478

Open
matthiasdiener wants to merge 42 commits into
devfrom
mdiener/ci-microbench
Open

Microbenchmarking, Torch+CSV-based#478
matthiasdiener wants to merge 42 commits into
devfrom
mdiener/ci-microbench

Conversation

@matthiasdiener
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener commented Mar 10, 2026

Description

See also #487.

Pytorch benchmark timing: https://docs.pytorch.org/tutorials/recipes/recipes/benchmark.html

Open questions:

  • What to do with performance-related env variables? (e.g., using Triton kernels for some operations, ck_tile for grouped GEMM, ...) Decision: use defaults for now
  • Do we need to rebuild the PR branch after perf testing is done?

Partly addresses https://github.com/ROCm/frameworks-internal/issues/15863

Microbenchmarking (not just) for CI.

  • Implemented 5 microbenchmarks, (fp16 gemm, fp8 gemm, fp16 grouped gemm, normalization, casting)
  • Implemented performance regression test for CI:
    1. Run benchmarks for PR branch
    2. Checkout base branch, re-build that branch, re-run benchmarks
    3. Compare results between base branch and PR branch
    4. Report comparison results as PR comment, but don't fail CI if performance regresses
  • Could be expanded to non-CI use cases (like nightly performance regression tests)
    • If additional CI time (currently, ~5 minutes) is an issue, could be part of just a Level 3 CI run

TODOs:

  • add attention, normalization, casting
  • print commit in header of comment

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added a set of 6 microbenchmarks (attention, fp16 gemm, fp8 gemm, fp16 grouped gemm, normalization, casting)
  • Create a test within CI that compares microbenchmark performance between PR branch and base branch

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Mar 10, 2026
@matthiasdiener matthiasdiener force-pushed the mdiener/ci-microbench branch 2 times, most recently from d9f25f2 to ce0775a Compare March 10, 2026 20:28
@matthiasdiener matthiasdiener force-pushed the mdiener/ci-microbench branch from ce0775a to 8a0ea47 Compare March 10, 2026 22:12
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 11, 2026

Performance Regression Report

MI325

PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT

Benchmark suite Median speedup Min speedup Max speedup
benchmark_attention 1.000x 0.650x 1.390x
benchmark_casting 0.998x 0.920x 1.034x
benchmark_gemm 1.001x 0.322x 2.361x
benchmark_gemm_fp8 0.986x 0.285x 1.610x
benchmark_grouped_gemm 1.000x 0.439x 2.438x
benchmark_normalization 1.006x 0.633x 2.066x
benchmark_attention (median 1.000x, min 0.650x, max 1.390x)
Case batch seq_len num_q_heads num_kv_heads head_dim TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1 2 1024 32 8 128 257.82 256.21 0.994x 180.08 198.09 1.100x
Llama3-8B/TP1 2 2048 32 8 128 385.63 384.63 0.997x 236.46 236.36 1.000x
Llama3-8B/TP1 2 4096 32 8 128 543.04 543.30 1.000x 275.26 275.64 1.001x
Llama3-8B/TP1 2 8192 32 8 128 705.72 709.27 1.005x 373.42 372.25 0.997x
Llama3-8B/TP8 2 1024 4 1 128 34.92 34.50 0.988x 22.25 21.58 0.970x
Llama3-8B/TP8 2 2048 4 1 128 125.61 124.96 0.995x 91.81 89.84 0.979x
Llama3-8B/TP8 2 4096 4 1 128 305.17 306.69 1.005x 255.60 255.78 1.001x
Llama3-8B/TP8 2 8192 4 1 128 406.33 523.55 1.288x 338.50 288.46 0.852x
Llama3-70B/TP8 2 1024 8 1 128 69.69 69.41 0.996x 44.22 44.54 1.007x
Llama3-70B/TP8 2 2048 8 1 128 236.62 238.34 1.007x 188.51 187.98 0.997x
Llama3-70B/TP8 2 4096 8 1 128 496.23 322.34 0.650x 238.89 300.79 1.259x
Llama3-70B/TP8 2 8192 8 1 128 617.34 616.97 0.999x 353.27 269.61 0.763x
Llama3-405B/TP8 2 1024 16 1 128 140.00 138.95 0.992x 89.30 124.16 1.390x
Llama3-405B/TP8 2 2048 16 1 128 364.79 365.62 1.002x 237.26 238.10 1.004x
Llama3-405B/TP8 2 4096 16 1 128 527.40 526.97 0.999x 258.35 240.51 0.931x
Llama3-405B/TP8 2 8192 16 1 128 697.06 697.91 1.001x 361.86 363.01 1.003x
Qwen2.5-7B/TP1 2 1024 28 4 128 230.63 234.43 1.016x 182.59 181.72 0.995x
Qwen2.5-7B/TP1 2 2048 28 4 128 404.59 405.18 1.001x 272.98 271.30 0.994x
Qwen2.5-7B/TP1 2 4096 28 4 128 590.77 517.93 0.877x 290.83 291.70 1.003x
Qwen2.5-7B/TP1 2 8192 28 4 128 707.55 687.44 0.972x 370.92 371.85 1.003x
Qwen2.5-72B/TP8 2 1024 8 1 128 67.26 68.40 1.017x 63.04 44.08 0.699x
Qwen2.5-72B/TP8 2 2048 8 1 128 242.45 238.28 0.983x 232.54 156.39 0.673x
Qwen2.5-72B/TP8 2 4096 8 1 128 493.67 494.51 1.002x 259.64 259.79 1.001x
Qwen2.5-72B/TP8 2 8192 8 1 128 615.70 616.16 1.001x 340.26 352.74 1.037x
benchmark_casting (median 0.998x, min 0.920x, max 1.034x)
Case M hidden_size dtype_str Cast GB/s Base Cast GB/s PR Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3 1024 4096 BF16-to-FP8-E4M3 731.70 704.60 0.963x
Llama3-8B/BF16-to-FP8-E4M3 2048 4096 BF16-to-FP8-E4M3 1254.70 1250.10 0.996x
Llama3-8B/BF16-to-FP8-E4M3 4096 4096 BF16-to-FP8-E4M3 2160.70 2220.80 1.028x
Llama3-8B/BF16-to-FP8-E4M3 8192 4096 BF16-to-FP8-E4M3 2583.80 2569.50 0.994x
Llama3-8B/FP8-E4M3-to-BF16 1024 4096 FP8-E4M3-to-BF16 1115.30 1026.00 0.920x
Llama3-8B/FP8-E4M3-to-BF16 2048 4096 FP8-E4M3-to-BF16 2365.20 2337.20 0.988x
Llama3-8B/FP8-E4M3-to-BF16 4096 4096 FP8-E4M3-to-BF16 3631.80 3662.40 1.008x
Llama3-8B/FP8-E4M3-to-BF16 8192 4096 FP8-E4M3-to-BF16 3916.40 3737.90 0.954x
Llama3-8B/BF16-to-FP8-E5M2 1024 4096 BF16-to-FP8-E5M2 747.60 750.00 1.003x
Llama3-8B/BF16-to-FP8-E5M2 2048 4096 BF16-to-FP8-E5M2 1260.10 1250.20 0.992x
Llama3-8B/BF16-to-FP8-E5M2 4096 4096 BF16-to-FP8-E5M2 2221.30 2217.50 0.998x
Llama3-8B/BF16-to-FP8-E5M2 8192 4096 BF16-to-FP8-E5M2 2489.00 2573.20 1.034x
Llama3-8B/FP8-E5M2-to-BF16 1024 4096 FP8-E5M2-to-BF16 1167.00 1195.00 1.024x
Llama3-8B/FP8-E5M2-to-BF16 2048 4096 FP8-E5M2-to-BF16 2411.00 2411.60 1.000x
Llama3-8B/FP8-E5M2-to-BF16 4096 4096 FP8-E5M2-to-BF16 3701.80 3693.20 0.998x
Llama3-8B/FP8-E5M2-to-BF16 8192 4096 FP8-E5M2-to-BF16 3962.00 3980.50 1.005x
Llama3-70B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1230.40 1233.00 1.002x
Llama3-70B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2298.20 2310.10 1.005x
Llama3-70B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 2682.70 2576.40 0.960x
Llama3-70B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1747.30 1723.50 0.986x
Llama3-70B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2368.60 2244.90 0.948x
Llama3-70B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 3685.50 3692.10 1.002x
Llama3-70B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 4001.20 3972.50 0.993x
Llama3-70B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 4310.00 4264.40 0.989x
Llama3-70B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1173.10 1170.30 0.998x
Llama3-70B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2295.70 2306.90 1.005x
Llama3-70B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 2682.40 2576.90 0.961x
Llama3-70B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1761.80 1678.20 0.953x
Llama3-70B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2414.50 2253.20 0.933x
Llama3-70B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 3677.70 3674.90 0.999x
Llama3-70B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 4029.50 3958.10 0.982x
Llama3-70B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 4307.00 4271.90 0.992x
Llama3-405B/BF16-to-FP8-E4M3 1024 16384 BF16-to-FP8-E4M3 2063.50 2076.70 1.006x
Llama3-405B/BF16-to-FP8-E4M3 2048 16384 BF16-to-FP8-E4M3 2323.90 2321.60 0.999x
Llama3-405B/BF16-to-FP8-E4M3 4096 16384 BF16-to-FP8-E4M3 1858.90 1836.20 0.988x
Llama3-405B/BF16-to-FP8-E4M3 8192 16384 BF16-to-FP8-E4M3 1183.70 1178.40 0.996x
Llama3-405B/FP8-E4M3-to-BF16 1024 16384 FP8-E4M3-to-BF16 3685.10 3698.70 1.004x
Llama3-405B/FP8-E4M3-to-BF16 2048 16384 FP8-E4M3-to-BF16 4006.40 4030.60 1.006x
Llama3-405B/FP8-E4M3-to-BF16 4096 16384 FP8-E4M3-to-BF16 4426.90 4412.00 0.997x
Llama3-405B/FP8-E4M3-to-BF16 8192 16384 FP8-E4M3-to-BF16 3714.70 3748.10 1.009x
Llama3-405B/BF16-to-FP8-E5M2 1024 16384 BF16-to-FP8-E5M2 2067.20 2096.30 1.014x
Llama3-405B/BF16-to-FP8-E5M2 2048 16384 BF16-to-FP8-E5M2 2320.10 2335.40 1.007x
Llama3-405B/BF16-to-FP8-E5M2 4096 16384 BF16-to-FP8-E5M2 1855.60 1834.70 0.989x
Llama3-405B/BF16-to-FP8-E5M2 8192 16384 BF16-to-FP8-E5M2 1183.80 1175.30 0.993x
Llama3-405B/FP8-E5M2-to-BF16 1024 16384 FP8-E5M2-to-BF16 3697.50 3692.20 0.999x
Llama3-405B/FP8-E5M2-to-BF16 2048 16384 FP8-E5M2-to-BF16 3997.10 4031.70 1.009x
Llama3-405B/FP8-E5M2-to-BF16 4096 16384 FP8-E5M2-to-BF16 4427.20 4429.10 1.000x
Llama3-405B/FP8-E5M2-to-BF16 8192 16384 FP8-E5M2-to-BF16 3692.60 3740.70 1.013x
Qwen2.5-7B/BF16-to-FP8-E4M3 1024 3584 BF16-to-FP8-E4M3 505.90 503.60 0.995x
Qwen2.5-7B/BF16-to-FP8-E4M3 2048 3584 BF16-to-FP8-E4M3 967.80 957.80 0.990x
Qwen2.5-7B/BF16-to-FP8-E4M3 4096 3584 BF16-to-FP8-E4M3 1404.30 1409.20 1.003x
Qwen2.5-7B/BF16-to-FP8-E4M3 8192 3584 BF16-to-FP8-E4M3 2593.90 2608.70 1.006x
Qwen2.5-7B/FP8-E4M3-to-BF16 1024 3584 FP8-E4M3-to-BF16 1000.80 1012.10 1.011x
Qwen2.5-7B/FP8-E4M3-to-BF16 2048 3584 FP8-E4M3-to-BF16 2081.80 1999.70 0.961x
Qwen2.5-7B/FP8-E4M3-to-BF16 4096 3584 FP8-E4M3-to-BF16 3708.30 3714.50 1.002x
Qwen2.5-7B/FP8-E4M3-to-BF16 8192 3584 FP8-E4M3-to-BF16 3864.60 3862.80 1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2 1024 3584 BF16-to-FP8-E5M2 505.70 503.60 0.996x
Qwen2.5-7B/BF16-to-FP8-E5M2 2048 3584 BF16-to-FP8-E5M2 968.00 959.20 0.991x
Qwen2.5-7B/BF16-to-FP8-E5M2 4096 3584 BF16-to-FP8-E5M2 1406.20 1408.70 1.002x
Qwen2.5-7B/BF16-to-FP8-E5M2 8192 3584 BF16-to-FP8-E5M2 2575.50 2594.60 1.007x
Qwen2.5-7B/FP8-E5M2-to-BF16 1024 3584 FP8-E5M2-to-BF16 1038.10 998.30 0.962x
Qwen2.5-7B/FP8-E5M2-to-BF16 2048 3584 FP8-E5M2-to-BF16 2111.80 2033.60 0.963x
Qwen2.5-7B/FP8-E5M2-to-BF16 4096 3584 FP8-E5M2-to-BF16 3715.70 3724.60 1.002x
Qwen2.5-7B/FP8-E5M2-to-BF16 8192 3584 FP8-E5M2-to-BF16 3895.90 3881.40 0.996x
Qwen2.5-72B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1242.90 1243.80 1.001x
Qwen2.5-72B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2292.30 2307.50 1.007x
Qwen2.5-72B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 2690.00 2577.50 0.958x
Qwen2.5-72B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1781.50 1748.70 0.982x
Qwen2.5-72B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2415.80 2333.40 0.966x
Qwen2.5-72B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 3688.30 3697.70 1.003x
Qwen2.5-72B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 3996.00 3997.00 1.000x
Qwen2.5-72B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 4299.10 4295.80 0.999x
Qwen2.5-72B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1182.50 1174.30 0.993x
Qwen2.5-72B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2295.40 2310.90 1.007x
Qwen2.5-72B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 2699.50 2582.60 0.957x
Qwen2.5-72B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1784.60 1746.70 0.979x
Qwen2.5-72B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2393.60 2315.80 0.967x
Qwen2.5-72B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 3703.50 3698.20 0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 4009.30 4007.20 0.999x
Qwen2.5-72B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 4419.90 4283.00 0.969x
benchmark_gemm (median 1.001x, min 0.322x, max 2.361x)
Case M N K dtype TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 591.68 540.88 0.914x 565.78 578.25 1.022x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 553.06 551.29 0.997x 524.17 526.60 1.005x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 746.97 747.83 1.001x 416.28 414.18 0.995x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 582.34 581.93 0.999x 668.69 666.49 0.997x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 677.20 680.06 1.004x 655.94 632.32 0.964x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 564.90 565.15 1.000x 658.52 650.44 0.988x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 648.45 731.64 1.128x 535.40 544.40 1.017x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 644.70 644.60 1.000x 707.26 707.21 1.000x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 682.29 728.15 1.067x 726.50 709.37 0.976x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 677.77 701.16 1.035x 735.75 733.85 0.997x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 675.27 758.75 1.124x 810.45 760.78 0.939x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 748.83 642.09 0.857x 782.79 848.88 1.084x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 744.54 743.87 0.999x 737.18 680.03 0.922x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 730.74 735.51 1.007x 751.13 743.40 0.990x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 755.67 769.35 1.018x 769.86 728.81 0.947x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 710.39 712.41 1.003x 770.32 733.15 0.952x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 152.18 151.90 0.998x 57.00 55.64 0.976x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 101.15 101.61 1.005x 36.73 37.60 1.024x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 523.70 524.64 1.002x 275.69 278.79 1.011x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 355.82 359.89 1.011x 125.42 126.47 1.008x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 299.05 300.83 1.006x 111.31 110.81 0.996x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 203.38 203.75 1.002x 74.00 74.44 1.006x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 591.88 583.00 0.985x 619.07 622.05 1.005x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 549.18 561.59 1.023x 265.03 272.37 1.028x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 492.65 497.52 1.010x 223.78 229.66 1.026x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 399.21 407.65 1.021x 145.15 136.23 0.939x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 708.43 709.10 1.001x 702.34 720.58 1.026x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 642.92 639.09 0.994x 590.33 598.71 1.014x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 591.15 588.95 0.996x 507.42 509.56 1.004x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 489.20 501.70 1.026x 332.85 329.49 0.990x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 764.35 765.54 1.002x 728.82 734.48 1.008x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 608.34 654.30 1.076x 701.91 667.24 0.951x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 417.74 414.70 0.993x 198.45 193.12 0.973x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 406.15 411.43 1.013x 145.32 146.47 1.008x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 700.72 697.09 0.995x 498.79 652.31 1.308x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 580.39 597.17 1.029x 570.68 565.70 0.991x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 498.32 500.43 1.004x 439.08 454.43 1.035x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 554.32 541.12 0.976x 316.73 329.96 1.042x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 758.46 754.91 0.995x 607.20 742.45 1.223x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 699.38 691.39 0.989x 514.35 656.10 1.276x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 519.74 527.64 1.015x 614.42 610.07 0.993x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 599.72 598.24 0.998x 609.34 597.44 0.980x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 326.94 771.77 2.361x 2374.71 764.81 0.322x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 696.83 695.70 0.998x 739.45 739.89 1.001x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 732.43 731.77 0.999x 676.91 678.74 1.003x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 608.01 601.17 0.989x 657.81 671.50 1.021x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 795.07 795.29 1.000x 780.07 778.58 0.998x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 707.68 709.46 1.003x 761.20 761.80 1.001x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 547.37 539.73 0.986x 603.38 601.90 0.998x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 564.40 566.00 1.003x 537.46 536.20 0.998x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 671.74 679.53 1.012x 708.74 700.37 0.988x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 661.08 660.55 0.999x 517.57 515.50 0.996x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 593.80 593.45 0.999x 688.76 686.37 0.997x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 616.14 614.20 0.997x 646.91 651.50 1.007x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 664.29 671.27 1.011x 745.92 705.88 0.946x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 694.89 693.52 0.998x 640.86 702.29 1.096x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 702.97 709.60 1.009x 626.73 719.54 1.148x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 702.37 703.01 1.001x 518.03 600.56 1.159x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 621.31 658.78 1.060x 788.88 762.88 0.967x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 735.20 729.72 0.993x 713.31 716.14 1.004x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 623.53 611.59 0.981x 735.79 744.59 1.012x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 706.78 705.36 0.998x 748.39 681.01 0.910x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 679.56 668.72 0.984x 712.11 776.83 1.091x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 725.54 724.35 0.998x 733.33 717.45 0.978x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 560.14 563.73 1.006x 304.27 313.21 1.029x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 508.56 496.35 0.976x 233.33 235.40 1.009x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 647.35 645.31 0.997x 592.55 594.47 1.003x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 597.84 603.16 1.009x 657.60 662.62 1.008x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 636.82 637.18 1.001x 606.07 598.09 0.987x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 574.44 584.08 1.017x 561.73 512.25 0.912x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 627.21 638.95 1.019x 532.69 527.39 0.990x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 618.05 620.19 1.003x 705.68 708.02 1.003x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 645.12 638.90 0.990x 719.66 722.77 1.004x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 694.93 664.31 0.956x 677.88 671.21 0.990x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 680.47 686.40 1.009x 725.63 729.91 1.006x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 617.73 715.70 1.159x 844.02 773.91 0.917x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 696.76 705.02 1.012x 736.71 733.10 0.995x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 743.13 745.12 1.003x 610.92 717.88 1.175x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 675.14 706.40 1.046x 785.37 767.06 0.977x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 675.69 673.23 0.996x 733.91 734.10 1.000x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 409.01 409.16 1.000x 190.95 189.72 0.994x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 404.01 414.30 1.025x 146.48 146.40 0.999x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 642.62 642.71 1.000x 612.83 614.72 1.003x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 611.45 604.58 0.989x 479.04 485.55 1.014x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 497.06 496.35 0.999x 446.73 442.49 0.991x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 541.94 528.69 0.976x 322.38 327.45 1.016x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 676.61 678.88 1.003x 699.78 696.51 0.995x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 627.88 626.47 0.998x 577.34 577.04 0.999x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 522.99 526.33 1.006x 608.79 607.74 0.998x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 590.00 595.77 1.010x 594.40 592.51 0.997x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 712.77 713.02 1.000x 723.54 660.74 0.913x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 701.61 699.06 0.996x 672.36 674.50 1.003x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 716.06 718.70 1.004x 669.16 676.93 1.012x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 601.04 601.07 1.000x 656.66 668.00 1.017x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 754.89 757.19 1.003x 702.16 692.87 0.987x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 593.87 712.95 1.201x 769.46 702.40 0.913x
benchmark_gemm_fp8 (median 0.986x, min 0.285x, max 1.610x)
Case M N K dtype FP8 Forward Base FP8 Forward PR FP8 Forward Speedup FP8 Backward Base FP8 Backward PR FP8 Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 378.14 380.23 1.006x 396.68 222.57 0.561x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 261.48 258.23 0.988x 180.05 149.01 0.828x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 441.97 446.80 1.011x 940.07 948.99 1.009x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 418.10 419.15 1.003x 938.10 737.39 0.786x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 688.50 654.45 0.951x 927.69 841.70 0.907x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 516.39 507.14 0.982x 890.86 343.41 0.385x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 703.37 698.54 0.993x 991.17 1000.88 1.010x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 644.41 656.80 1.019x 1026.81 1000.97 0.975x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 860.17 855.63 0.995x 1006.47 1008.00 1.002x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 756.96 750.96 0.992x 994.86 627.90 0.631x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 891.83 872.75 0.979x 1077.84 1074.88 0.997x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 776.55 777.02 1.001x 1209.00 1205.00 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 978.53 972.66 0.994x 1052.30 925.52 0.880x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 885.61 957.24 1.081x 1133.74 1074.68 0.948x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1084.89 1095.76 1.010x 1098.03 1068.04 0.973x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 859.54 782.10 0.910x 1195.36 1279.06 1.070x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 46.68 45.77 0.981x 49.17 30.54 0.621x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 31.01 23.09 0.745x 26.42 18.70 0.708x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 216.19 212.29 0.982x 396.30 206.62 0.521x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 107.89 105.90 0.982x 90.11 102.33 1.136x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 92.12 90.78 0.985x 71.60 86.40 1.207x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 60.85 60.43 0.993x 51.91 39.90 0.769x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 427.92 420.70 0.983x 358.77 276.59 0.771x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 212.98 209.85 0.985x 179.69 51.16 0.285x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 181.83 180.56 0.993x 153.02 98.67 0.645x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 121.14 119.86 0.989x 102.16 66.09 0.647x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 746.10 734.26 0.984x 748.24 468.74 0.626x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 418.60 385.64 0.921x 359.98 231.24 0.642x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 359.00 353.96 0.986x 303.24 194.26 0.641x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 239.68 235.41 0.982x 198.77 130.20 0.655x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 882.32 857.75 0.972x 853.81 1082.94 1.268x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 678.54 690.40 1.017x 764.79 474.50 0.620x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 149.68 147.15 0.983x 126.81 80.49 0.635x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 119.57 117.04 0.979x 99.28 63.90 0.644x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 427.00 426.75 0.999x 804.59 584.35 0.726x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 415.27 407.05 0.980x 349.06 223.73 0.641x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 297.18 292.71 0.985x 248.77 159.93 0.643x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 236.61 232.61 0.983x 196.43 127.03 0.647x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 610.89 502.97 0.823x 902.80 1453.88 1.610x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 700.47 713.50 1.019x 733.98 451.37 0.615x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 542.17 546.10 1.007x 506.34 315.65 0.623x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 463.04 458.40 0.990x 391.22 273.45 0.699x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 841.38 820.94 0.976x 1151.14 1186.75 1.031x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 882.45 842.82 0.955x 1026.79 1054.82 1.027x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 443.03 446.33 1.007x 962.61 955.33 0.992x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 675.66 670.79 0.993x 546.66 544.59 0.996x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 863.77 860.41 0.996x 1236.13 1158.05 0.937x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 976.25 972.75 0.996x 1045.82 1051.59 1.006x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 488.99 492.35 1.007x 439.79 433.71 0.986x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 454.92 464.26 1.021x 384.34 381.92 0.994x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 479.77 482.22 1.005x 1055.76 1047.58 0.992x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 466.09 467.39 1.003x 925.94 921.43 0.995x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 674.19 649.05 0.963x 951.73 954.59 1.003x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 706.04 701.10 0.993x 814.34 819.39 1.006x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 689.36 700.17 1.016x 1124.65 1040.51 0.925x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 665.66 559.19 0.840x 1046.41 1222.82 1.169x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 599.33 606.85 1.013x 1133.79 1107.40 0.977x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 824.37 826.55 1.003x 823.46 829.47 1.007x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 835.02 828.55 0.992x 1180.13 1212.87 1.028x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 860.97 857.20 0.996x 1136.42 1138.12 1.001x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 741.53 740.34 0.998x 1145.48 1037.09 0.905x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 969.80 969.06 0.999x 882.59 810.85 0.919x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 945.27 974.19 1.031x 1296.04 1276.34 0.985x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 913.89 982.00 1.075x 1208.61 1154.32 0.955x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 216.95 215.66 0.994x 168.68 113.82 0.675x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 171.68 168.54 0.982x 141.76 88.56 0.625x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 464.55 465.51 1.002x 915.88 923.65 1.008x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 397.47 400.37 1.007x 985.68 660.54 0.670x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 438.89 430.90 0.982x 335.96 222.31 0.662x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 340.03 334.53 0.984x 272.67 173.02 0.635x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 606.30 676.60 1.116x 1020.92 935.10 0.916x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 605.81 596.32 0.984x 1001.46 1020.38 1.019x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 780.96 443.45 0.568x 737.00 575.81 0.781x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 618.43 632.15 1.022x 573.33 348.09 0.607x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 879.99 894.78 1.017x 1058.47 1053.23 0.995x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 731.04 727.08 0.995x 1158.43 1051.43 0.908x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 906.38 921.77 1.017x 1066.33 1057.70 0.992x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 844.81 852.78 1.009x 1020.12 784.32 0.769x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 988.73 990.40 1.002x 1046.09 1071.67 1.024x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 826.23 823.97 0.997x 1142.17 1225.21 1.073x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 138.15 135.14 0.978x 107.97 68.12 0.631x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 110.14 107.58 0.977x 87.93 52.25 0.594x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 389.01 389.72 1.002x 821.18 800.21 0.974x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 330.70 334.50 1.011x 348.97 204.55 0.586x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 272.61 267.24 0.980x 218.58 135.16 0.618x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 218.03 214.46 0.984x 173.37 107.57 0.620x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 637.36 630.76 0.990x 983.71 1040.77 1.058x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 327.51 483.54 1.476x 1393.90 456.44 0.327x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 526.68 315.77 0.600x 435.24 321.30 0.738x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 425.69 424.82 0.998x 344.86 213.62 0.619x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 806.53 794.63 0.985x 1090.67 1111.11 1.019x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 643.16 640.06 0.995x 1008.94 1007.99 0.999x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 433.62 312.92 0.722x 938.52 1233.54 1.314x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 653.31 655.29 1.003x 532.67 453.51 0.851x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 849.70 856.01 1.007x 1240.85 1234.23 0.995x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 836.43 838.64 1.003x 962.93 964.94 1.002x
benchmark_grouped_gemm (median 1.000x, min 0.439x, max 2.438x)
Case B M N K dtype TE (CK_Tile) Forward Base TE (CK_Tile) Forward PR TE (CK_Tile) Forward Speedup TE (CK_Tile) Backward Base TE (CK_Tile) Backward PR TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP 2 512 2816 2048 torch.bfloat16 136.61 136.55 1.000x 146.06 146.00 1.000x
DSV2-Lite-Down 2 512 2048 1408 torch.bfloat16 94.93 94.59 0.996x 147.73 147.64 0.999x
DSV2-Lite-GateUP 2 1024 2816 2048 torch.bfloat16 107.97 263.21 2.438x 241.01 246.45 1.023x
DSV2-Lite-Down 2 1024 2048 1408 torch.bfloat16 184.01 183.88 0.999x 248.51 247.66 0.997x
DSV2-Lite-GateUP 2 2048 2816 2048 torch.bfloat16 443.08 194.70 0.439x 346.09 353.65 1.022x
DSV2-Lite-Down 2 2048 2048 1408 torch.bfloat16 339.55 341.79 1.007x 361.04 361.20 1.000x
DSV2-Lite-GateUP 2 4096 2816 2048 torch.bfloat16 451.10 452.78 1.004x 453.46 435.64 0.961x
DSV2-Lite-Down 2 4096 2048 1408 torch.bfloat16 535.49 535.92 1.001x 387.76 385.58 0.994x
DSV2-Lite-GateUP 4 512 2816 2048 torch.bfloat16 261.10 261.21 1.000x 224.11 223.33 0.997x
DSV2-Lite-Down 4 512 2048 1408 torch.bfloat16 182.07 182.05 1.000x 223.48 223.01 0.998x
DSV2-Lite-GateUP 4 1024 2816 2048 torch.bfloat16 433.24 433.93 1.002x 337.89 337.33 0.998x
DSV2-Lite-Down 4 1024 2048 1408 torch.bfloat16 334.81 333.68 0.997x 329.57 329.23 0.999x
DSV2-Lite-GateUP 4 2048 2816 2048 torch.bfloat16 425.44 424.19 0.997x 437.01 437.32 1.001x
DSV2-Lite-Down 4 2048 2048 1408 torch.bfloat16 523.87 525.98 1.004x 337.00 348.13 1.033x
DSV2-Lite-GateUP 4 4096 2816 2048 torch.bfloat16 570.78 563.79 0.988x 446.51 446.61 1.000x
DSV2-Lite-Down 4 4096 2048 1408 torch.bfloat16 544.30 554.23 1.018x 412.63 410.77 0.995x
DSV2-Lite-GateUP 8 512 2816 2048 torch.bfloat16 423.34 419.98 0.992x 335.54 332.86 0.992x
DSV2-Lite-Down 8 512 2048 1408 torch.bfloat16 325.58 324.71 0.997x 320.35 319.85 0.998x
DSV2-Lite-GateUP 8 1024 2816 2048 torch.bfloat16 407.18 405.29 0.995x 440.04 441.57 1.003x
DSV2-Lite-Down 8 1024 2048 1408 torch.bfloat16 504.75 501.55 0.994x 364.48 364.15 0.999x
DSV2-Lite-GateUP 8 2048 2816 2048 torch.bfloat16 567.09 567.78 1.001x 495.92 474.38 0.957x
DSV2-Lite-Down 8 2048 2048 1408 torch.bfloat16 536.33 536.70 1.001x 459.26 459.74 1.001x
DSV2-Lite-GateUP 8 4096 2816 2048 torch.bfloat16 611.51 615.57 1.007x 514.81 515.23 1.001x
DSV2-Lite-Down 8 4096 2048 1408 torch.bfloat16 570.06 570.33 1.000x 499.43 496.88 0.995x
DSV2-GateUP 5 512 3072 5120 torch.bfloat16 356.91 360.34 1.010x 422.01 410.45 0.973x
DSV2-Down 5 512 5120 1536 torch.bfloat16 443.37 440.61 0.994x 254.27 253.67 0.998x
DSV2-GateUP 5 1024 3072 5120 torch.bfloat16 606.34 606.51 1.000x 480.05 485.13 1.011x
DSV2-Down 5 1024 5120 1536 torch.bfloat16 459.41 460.46 1.002x 388.05 388.54 1.001x
DSV2-GateUP 5 2048 3072 5120 torch.bfloat16 579.84 579.33 0.999x 559.41 559.62 1.000x
DSV2-Down 5 2048 5120 1536 torch.bfloat16 585.94 586.28 1.001x 538.69 539.68 1.002x
DSV2-GateUP 5 4096 3072 5120 torch.bfloat16 612.67 611.57 0.998x 567.12 573.83 1.012x
DSV2-Down 5 4096 5120 1536 torch.bfloat16 590.31 589.30 0.998x 558.32 558.86 1.001x
DSV2-GateUP 10 512 3072 5120 torch.bfloat16 526.42 524.88 0.997x 418.98 417.29 0.996x
DSV2-Down 10 512 5120 1536 torch.bfloat16 446.07 446.85 1.002x 347.50 348.18 1.002x
DSV2-GateUP 10 1024 3072 5120 torch.bfloat16 493.04 490.06 0.994x 519.32 514.08 0.990x
DSV2-Down 10 1024 5120 1536 torch.bfloat16 548.76 543.97 0.991x 482.00 482.63 1.001x
DSV2-GateUP 10 2048 3072 5120 torch.bfloat16 596.15 596.18 1.000x 527.45 528.58 1.002x
DSV2-Down 10 2048 5120 1536 torch.bfloat16 544.78 546.27 1.003x 535.06 534.90 1.000x
DSV2-GateUP 10 4096 3072 5120 torch.bfloat16 654.99 653.80 0.998x 579.65 578.41 0.998x
DSV2-Down 10 4096 5120 1536 torch.bfloat16 567.31 569.53 1.004x 555.25 556.35 1.002x
DSV2-GateUP 20 512 3072 5120 torch.bfloat16 516.01 515.30 0.999x 438.67 437.76 0.998x
DSV2-Down 20 512 5120 1536 torch.bfloat16 490.73 495.44 1.010x 421.15 421.12 1.000x
DSV2-GateUP 20 1024 3072 5120 torch.bfloat16 544.96 541.73 0.994x 513.68 512.47 0.998x
DSV2-Down 20 1024 5120 1536 torch.bfloat16 512.41 512.31 1.000x 472.34 472.73 1.001x
DSV2-GateUP 20 2048 3072 5120 torch.bfloat16 641.16 641.96 1.001x 542.03 556.34 1.026x
DSV2-Down 20 2048 5120 1536 torch.bfloat16 556.16 557.08 1.002x 508.18 507.29 0.998x
DSV2-GateUP 20 4096 3072 5120 torch.bfloat16 662.95 663.15 1.000x 563.79 556.14 0.986x
DSV2-Down 20 4096 5120 1536 torch.bfloat16 546.03 548.11 1.004x 551.84 564.05 1.022x
DSV3-GateUP 8 512 4096 7168 torch.bfloat16 561.83 561.96 1.000x 439.28 439.53 1.001x
DSV3-Down 8 512 7168 2048 torch.bfloat16 506.00 502.19 0.992x 363.25 363.13 1.000x
DSV3-GateUP 8 1024 4096 7168 torch.bfloat16 593.38 594.00 1.001x 537.98 537.92 1.000x
DSV3-Down 8 1024 7168 2048 torch.bfloat16 605.11 607.23 1.004x 512.19 513.62 1.003x
DSV3-GateUP 8 2048 4096 7168 torch.bfloat16 629.67 619.95 0.985x 566.65 567.10 1.001x
DSV3-Down 8 2048 7168 2048 torch.bfloat16 628.29 630.02 1.003x 555.61 516.41 0.929x
DSV3-GateUP 8 4096 4096 7168 torch.bfloat16 656.86 682.34 1.039x 574.16 573.92 1.000x
DSV3-Down 8 4096 7168 2048 torch.bfloat16 639.12 641.34 1.003x 552.53 573.03 1.037x
DSV3-GateUP 16 512 4096 7168 torch.bfloat16 544.52 543.91 0.999x 450.89 462.98 1.027x
DSV3-Down 16 512 7168 2048 torch.bfloat16 535.66 535.10 0.999x 439.26 439.21 1.000x
DSV3-GateUP 16 1024 4096 7168 torch.bfloat16 567.41 538.63 0.949x 524.70 524.45 1.000x
DSV3-Down 16 1024 7168 2048 torch.bfloat16 594.85 595.96 1.002x 490.64 490.53 1.000x
DSV3-GateUP 16 2048 4096 7168 torch.bfloat16 669.68 638.09 0.953x 555.96 556.79 1.001x
DSV3-Down 16 2048 7168 2048 torch.bfloat16 626.21 626.69 1.001x 528.65 550.39 1.041x
DSV3-GateUP 16 4096 4096 7168 torch.bfloat16 657.35 670.19 1.020x 583.75 579.27 0.992x
DSV3-Down 16 4096 7168 2048 torch.bfloat16 632.78 632.01 0.999x 582.36 572.27 0.983x
DSV3-GateUP 32 512 4096 7168 torch.bfloat16 538.44 538.55 1.000x 426.59 426.21 0.999x
DSV3-Down 32 512 7168 2048 torch.bfloat16 526.51 459.17 0.872x 402.05 426.37 1.060x
DSV3-GateUP 32 1024 4096 7168 torch.bfloat16 644.04 642.56 0.998x 506.65 514.44 1.015x
DSV3-Down 32 1024 7168 2048 torch.bfloat16 588.36 587.85 0.999x 472.44 473.26 1.002x
DSV3-GateUP 32 2048 4096 7168 torch.bfloat16 656.46 641.24 0.977x 554.86 559.71 1.009x
DSV3-Down 32 2048 7168 2048 torch.bfloat16 613.63 613.77 1.000x 545.77 554.32 1.016x
DSV3-GateUP 32 4096 4096 7168 torch.bfloat16 667.88 674.54 1.010x 571.30 571.35 1.000x
DSV3-Down 32 4096 7168 2048 torch.bfloat16 599.49 610.80 1.019x 574.82 575.19 1.001x
Grok-V2-GateUP 1 512 32768 8192 torch.bfloat16 514.60 515.22 1.001x 553.62 555.90 1.004x
Grok-V2-Down 1 512 8192 16384 torch.bfloat16 480.71 483.43 1.006x 570.76 577.15 1.011x
Grok-V2-GateUP 1 1024 32768 8192 torch.bfloat16 614.69 614.51 1.000x 646.34 605.78 0.937x
Grok-V2-Down 1 1024 8192 16384 torch.bfloat16 534.74 533.95 0.999x 645.96 645.59 0.999x
Grok-V2-GateUP 1 2048 32768 8192 torch.bfloat16 659.29 659.61 1.000x 746.55 747.23 1.001x
Grok-V2-Down 1 2048 8192 16384 torch.bfloat16 605.26 605.95 1.001x 718.30 716.29 0.997x
Grok-V2-GateUP 1 4096 32768 8192 torch.bfloat16 724.20 724.67 1.001x 761.82 762.57 1.001x
Grok-V2-Down 1 4096 8192 16384 torch.bfloat16 626.54 627.74 1.002x 743.74 746.09 1.003x
benchmark_normalization (median 1.006x, min 0.633x, max 2.066x)
Case M hidden_size dtype TE Forward GB/s Base TE Forward GB/s PR TE Forward GB/s Speedup TE Backward GB/s Base TE Backward GB/s PR TE Backward GB/s Speedup
Llama3-8B/RMSNorm 1024 4096 torch.bfloat16 609.30 578.90 0.950x 312.20 644.60 2.065x
Llama3-8B/RMSNorm 2048 4096 torch.bfloat16 1205.50 1163.00 0.965x 643.80 1329.80 2.066x
Llama3-8B/RMSNorm 4096 4096 torch.bfloat16 2449.20 2366.20 0.966x 2595.00 2612.60 1.007x
Llama3-8B/RMSNorm 8192 4096 torch.bfloat16 3812.50 3856.40 1.012x 4284.00 4211.50 0.983x
Llama3-8B/LayerNorm 1024 4096 torch.bfloat16 528.60 491.40 0.930x 561.00 591.70 1.055x
Llama3-8B/LayerNorm 2048 4096 torch.bfloat16 1045.80 1020.30 0.976x 1111.10 1144.10 1.030x
Llama3-8B/LayerNorm 4096 4096 torch.bfloat16 2031.30 2021.20 0.995x 2286.40 2294.30 1.003x
Llama3-8B/LayerNorm 8192 4096 torch.bfloat16 3497.10 3520.10 1.007x 4018.80 3071.20 0.764x
Llama3-70B/RMSNorm 1024 8192 torch.bfloat16 1215.30 1176.60 0.968x 1320.10 844.50 0.640x
Llama3-70B/RMSNorm 2048 8192 torch.bfloat16 2402.30 2365.50 0.985x 2684.10 1698.30 0.633x
Llama3-70B/RMSNorm 4096 8192 torch.bfloat16 3086.30 3054.70 0.990x 3292.10 3278.60 0.996x
Llama3-70B/RMSNorm 8192 8192 torch.bfloat16 3639.30 3637.10 0.999x 3560.70 3486.10 0.979x
Llama3-70B/LayerNorm 1024 8192 torch.bfloat16 1034.70 1024.60 0.990x 657.30 556.50 0.847x
Llama3-70B/LayerNorm 2048 8192 torch.bfloat16 2054.00 2042.30 0.994x 731.70 737.80 1.008x
Llama3-70B/LayerNorm 4096 8192 torch.bfloat16 2970.00 2961.50 0.997x 640.90 678.80 1.059x
Llama3-70B/LayerNorm 8192 8192 torch.bfloat16 3524.30 3538.60 1.004x 599.10 625.10 1.043x
Llama3-405B/RMSNorm 1024 16384 torch.bfloat16 664.60 681.30 1.025x 536.60 519.80 0.969x
Llama3-405B/RMSNorm 2048 16384 torch.bfloat16 714.30 718.80 1.006x 522.70 535.30 1.024x
Llama3-405B/RMSNorm 4096 16384 torch.bfloat16 729.40 726.20 0.996x 455.10 464.00 1.020x
Llama3-405B/RMSNorm 8192 16384 torch.bfloat16 608.40 608.50 1.000x 477.20 479.80 1.005x
Llama3-405B/LayerNorm 1024 16384 torch.bfloat16 672.50 681.50 1.013x 614.40 624.50 1.016x
Llama3-405B/LayerNorm 2048 16384 torch.bfloat16 673.40 672.10 0.998x 597.90 637.10 1.066x
Llama3-405B/LayerNorm 4096 16384 torch.bfloat16 701.00 701.40 1.001x 522.90 548.70 1.049x
Llama3-405B/LayerNorm 8192 16384 torch.bfloat16 577.60 582.30 1.008x 606.40 607.30 1.001x
Qwen2.5-7B/RMSNorm 1024 3584 torch.bfloat16 522.70 525.70 1.006x 181.50 296.20 1.632x
Qwen2.5-7B/RMSNorm 2048 3584 torch.bfloat16 1049.10 1036.50 0.988x 377.00 588.70 1.562x
Qwen2.5-7B/RMSNorm 4096 3584 torch.bfloat16 2108.70 2067.30 0.980x 734.20 997.60 1.359x
Qwen2.5-7B/RMSNorm 8192 3584 torch.bfloat16 2905.40 2927.20 1.008x 1521.10 1640.50 1.078x
Qwen2.5-7B/LayerNorm 1024 3584 torch.bfloat16 444.50 447.00 1.006x 146.90 248.20 1.690x
Qwen2.5-7B/LayerNorm 2048 3584 torch.bfloat16 902.80 899.30 0.996x 296.40 498.60 1.682x
Qwen2.5-7B/LayerNorm 4096 3584 torch.bfloat16 1767.00 1768.60 1.001x 583.20 891.80 1.529x
Qwen2.5-7B/LayerNorm 8192 3584 torch.bfloat16 2531.80 2573.90 1.017x 1223.00 1436.60 1.175x
Qwen2.5-72B/RMSNorm 1024 8192 torch.bfloat16 1168.60 1188.60 1.017x 396.20 672.10 1.696x
Qwen2.5-72B/RMSNorm 2048 8192 torch.bfloat16 2348.80 2354.40 1.002x 816.50 1374.00 1.683x
Qwen2.5-72B/RMSNorm 4096 8192 torch.bfloat16 3032.00 3021.70 0.997x 1770.80 2928.60 1.654x
Qwen2.5-72B/RMSNorm 8192 8192 torch.bfloat16 3568.60 3604.10 1.010x 3541.80 3523.90 0.995x
Qwen2.5-72B/LayerNorm 1024 8192 torch.bfloat16 1039.00 1006.90 0.969x 319.90 543.40 1.699x
Qwen2.5-72B/LayerNorm 2048 8192 torch.bfloat16 2085.00 2058.50 0.987x 645.70 741.00 1.148x
Qwen2.5-72B/LayerNorm 4096 8192 torch.bfloat16 2987.20 2967.70 0.993x 643.30 679.60 1.056x
Qwen2.5-72B/LayerNorm 8192 8192 torch.bfloat16 3434.70 3502.90 1.020x 592.70 626.00 1.056x

MI355

PR commit: ddd17d4 | Base: dev | 2026-03-17 18:30:52 CDT

Benchmark suite Median speedup Min speedup Max speedup
benchmark_attention 1.000x 0.969x 1.014x
benchmark_casting 0.998x 0.898x 1.138x
benchmark_gemm 1.001x 0.935x 1.094x
benchmark_gemm_fp8 0.988x 0.401x 1.038x
benchmark_grouped_gemm 0.999x 0.912x 1.134x
benchmark_normalization 0.993x 0.399x 1.490x
benchmark_attention (median 1.000x, min 0.969x, max 1.014x)
Case batch seq_len num_q_heads num_kv_heads head_dim TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1 2 1024 32 8 128 298.06 298.56 1.002x 246.01 238.32 0.969x
Llama3-8B/TP1 2 2048 32 8 128 531.54 535.00 1.007x 273.78 273.40 0.999x
Llama3-8B/TP1 2 4096 32 8 128 681.26 681.83 1.001x 297.55 298.62 1.004x
Llama3-8B/TP1 2 8192 32 8 128 943.78 946.38 1.003x 392.95 393.58 1.002x
Llama3-8B/TP8 2 1024 4 1 128 38.23 37.85 0.990x 38.08 38.14 1.002x
Llama3-8B/TP8 2 2048 4 1 128 151.71 149.89 0.988x 146.00 146.67 1.005x
Llama3-8B/TP8 2 4096 4 1 128 421.15 417.61 0.992x 289.94 290.16 1.001x
Llama3-8B/TP8 2 8192 4 1 128 697.50 700.77 1.005x 312.90 313.96 1.003x
Llama3-70B/TP8 2 1024 8 1 128 75.79 75.91 1.002x 76.97 75.59 0.982x
Llama3-70B/TP8 2 2048 8 1 128 304.14 301.02 0.990x 270.62 274.53 1.014x
Llama3-70B/TP8 2 4096 8 1 128 615.32 616.78 1.002x 301.91 301.68 0.999x
Llama3-70B/TP8 2 8192 8 1 128 752.84 759.49 1.009x 351.40 353.16 1.005x
Llama3-405B/TP8 2 1024 16 1 128 153.61 152.41 0.992x 152.84 153.13 1.002x
Llama3-405B/TP8 2 2048 16 1 128 494.52 496.16 1.003x 277.21 277.06 0.999x
Llama3-405B/TP8 2 4096 16 1 128 667.38 670.21 1.004x 298.69 298.46 0.999x
Llama3-405B/TP8 2 8192 16 1 128 900.01 900.16 1.000x 382.68 382.27 0.999x
Qwen2.5-7B/TP1 2 1024 28 4 128 266.49 262.94 0.987x 221.11 222.98 1.008x
Qwen2.5-7B/TP1 2 2048 28 4 128 491.33 491.07 0.999x 247.61 247.99 1.002x
Qwen2.5-7B/TP1 2 4096 28 4 128 685.47 685.32 1.000x 302.15 302.30 1.000x
Qwen2.5-7B/TP1 2 8192 28 4 128 953.99 952.98 0.999x 395.01 394.68 0.999x
Qwen2.5-72B/TP8 2 1024 8 1 128 76.69 75.99 0.991x 76.62 75.80 0.989x
Qwen2.5-72B/TP8 2 2048 8 1 128 306.02 302.08 0.987x 271.20 274.05 1.011x
Qwen2.5-72B/TP8 2 4096 8 1 128 618.80 617.37 0.998x 301.97 302.10 1.000x
Qwen2.5-72B/TP8 2 8192 8 1 128 752.92 756.74 1.005x 353.63 353.38 0.999x
benchmark_casting (median 0.998x, min 0.898x, max 1.138x)
Case M hidden_size dtype_str Cast GB/s Base Cast GB/s PR Cast GB/s Speedup
Llama3-8B/BF16-to-FP8-E4M3 1024 4096 BF16-to-FP8-E4M3 764.70 756.30 0.989x
Llama3-8B/BF16-to-FP8-E4M3 2048 4096 BF16-to-FP8-E4M3 1224.70 1232.50 1.006x
Llama3-8B/BF16-to-FP8-E4M3 4096 4096 BF16-to-FP8-E4M3 2145.90 2178.90 1.015x
Llama3-8B/BF16-to-FP8-E4M3 8192 4096 BF16-to-FP8-E4M3 1738.20 1746.90 1.005x
Llama3-8B/FP8-E4M3-to-BF16 1024 4096 FP8-E4M3-to-BF16 1160.80 1092.60 0.941x
Llama3-8B/FP8-E4M3-to-BF16 2048 4096 FP8-E4M3-to-BF16 2545.40 2392.40 0.940x
Llama3-8B/FP8-E4M3-to-BF16 4096 4096 FP8-E4M3-to-BF16 4642.80 4641.40 1.000x
Llama3-8B/FP8-E4M3-to-BF16 8192 4096 FP8-E4M3-to-BF16 5711.60 5694.70 0.997x
Llama3-8B/BF16-to-FP8-E5M2 1024 4096 BF16-to-FP8-E5M2 764.10 763.20 0.999x
Llama3-8B/BF16-to-FP8-E5M2 2048 4096 BF16-to-FP8-E5M2 1221.40 1234.10 1.010x
Llama3-8B/BF16-to-FP8-E5M2 4096 4096 BF16-to-FP8-E5M2 2159.10 2178.30 1.009x
Llama3-8B/BF16-to-FP8-E5M2 8192 4096 BF16-to-FP8-E5M2 1742.10 1750.10 1.005x
Llama3-8B/FP8-E5M2-to-BF16 1024 4096 FP8-E5M2-to-BF16 1249.40 1122.30 0.898x
Llama3-8B/FP8-E5M2-to-BF16 2048 4096 FP8-E5M2-to-BF16 2542.20 2353.60 0.926x
Llama3-8B/FP8-E5M2-to-BF16 4096 4096 FP8-E5M2-to-BF16 4713.90 4636.80 0.984x
Llama3-8B/FP8-E5M2-to-BF16 8192 4096 FP8-E5M2-to-BF16 5684.70 5697.40 1.002x
Llama3-70B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1248.20 1253.10 1.004x
Llama3-70B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2254.30 2260.60 1.003x
Llama3-70B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 1584.10 1802.80 1.138x
Llama3-70B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1903.50 1903.10 1.000x
Llama3-70B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2533.20 2439.10 0.963x
Llama3-70B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 4719.90 4698.20 0.995x
Llama3-70B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 5733.00 5705.20 0.995x
Llama3-70B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 6595.90 6578.60 0.997x
Llama3-70B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1253.10 1261.70 1.007x
Llama3-70B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2257.00 2269.50 1.006x
Llama3-70B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 1812.90 1801.70 0.994x
Llama3-70B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1902.80 1871.40 0.983x
Llama3-70B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2548.60 2453.50 0.963x
Llama3-70B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 4702.40 4717.80 1.003x
Llama3-70B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 5736.00 5727.50 0.999x
Llama3-70B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 6451.10 6457.60 1.001x
Llama3-405B/BF16-to-FP8-E4M3 1024 16384 BF16-to-FP8-E4M3 2267.60 2261.30 0.997x
Llama3-405B/BF16-to-FP8-E4M3 2048 16384 BF16-to-FP8-E4M3 1674.70 1669.70 0.997x
Llama3-405B/BF16-to-FP8-E4M3 4096 16384 BF16-to-FP8-E4M3 1866.20 1867.10 1.000x
Llama3-405B/BF16-to-FP8-E4M3 8192 16384 BF16-to-FP8-E4M3 1080.60 1091.40 1.010x
Llama3-405B/FP8-E4M3-to-BF16 1024 16384 FP8-E4M3-to-BF16 4635.60 4611.80 0.995x
Llama3-405B/FP8-E4M3-to-BF16 2048 16384 FP8-E4M3-to-BF16 5696.40 5685.10 0.998x
Llama3-405B/FP8-E4M3-to-BF16 4096 16384 FP8-E4M3-to-BF16 6602.00 6505.70 0.985x
Llama3-405B/FP8-E4M3-to-BF16 8192 16384 FP8-E4M3-to-BF16 5371.20 5383.90 1.002x
Llama3-405B/BF16-to-FP8-E5M2 1024 16384 BF16-to-FP8-E5M2 2268.10 2269.90 1.001x
Llama3-405B/BF16-to-FP8-E5M2 2048 16384 BF16-to-FP8-E5M2 1673.80 1670.60 0.998x
Llama3-405B/BF16-to-FP8-E5M2 4096 16384 BF16-to-FP8-E5M2 1860.20 1863.20 1.002x
Llama3-405B/BF16-to-FP8-E5M2 8192 16384 BF16-to-FP8-E5M2 1076.80 1083.20 1.006x
Llama3-405B/FP8-E5M2-to-BF16 1024 16384 FP8-E5M2-to-BF16 4643.10 4630.50 0.997x
Llama3-405B/FP8-E5M2-to-BF16 2048 16384 FP8-E5M2-to-BF16 5709.10 5703.70 0.999x
Llama3-405B/FP8-E5M2-to-BF16 4096 16384 FP8-E5M2-to-BF16 6578.80 6595.10 1.002x
Llama3-405B/FP8-E5M2-to-BF16 8192 16384 FP8-E5M2-to-BF16 5379.10 5371.30 0.999x
Qwen2.5-7B/BF16-to-FP8-E4M3 1024 3584 BF16-to-FP8-E4M3 711.50 684.10 0.961x
Qwen2.5-7B/BF16-to-FP8-E4M3 2048 3584 BF16-to-FP8-E4M3 1187.40 1195.50 1.007x
Qwen2.5-7B/BF16-to-FP8-E4M3 4096 3584 BF16-to-FP8-E4M3 2155.40 2156.50 1.001x
Qwen2.5-7B/BF16-to-FP8-E4M3 8192 3584 BF16-to-FP8-E4M3 2538.90 2645.00 1.042x
Qwen2.5-7B/FP8-E4M3-to-BF16 1024 3584 FP8-E4M3-to-BF16 1109.70 1078.90 0.972x
Qwen2.5-7B/FP8-E4M3-to-BF16 2048 3584 FP8-E4M3-to-BF16 2205.50 2084.20 0.945x
Qwen2.5-7B/FP8-E4M3-to-BF16 4096 3584 FP8-E4M3-to-BF16 4411.40 4350.20 0.986x
Qwen2.5-7B/FP8-E4M3-to-BF16 8192 3584 FP8-E4M3-to-BF16 5542.90 5526.80 0.997x
Qwen2.5-7B/BF16-to-FP8-E5M2 1024 3584 BF16-to-FP8-E5M2 711.50 688.30 0.967x
Qwen2.5-7B/BF16-to-FP8-E5M2 2048 3584 BF16-to-FP8-E5M2 1195.20 1195.50 1.000x
Qwen2.5-7B/BF16-to-FP8-E5M2 4096 3584 BF16-to-FP8-E5M2 2151.90 2157.80 1.003x
Qwen2.5-7B/BF16-to-FP8-E5M2 8192 3584 BF16-to-FP8-E5M2 2540.10 2647.00 1.042x
Qwen2.5-7B/FP8-E5M2-to-BF16 1024 3584 FP8-E5M2-to-BF16 1048.60 1077.50 1.028x
Qwen2.5-7B/FP8-E5M2-to-BF16 2048 3584 FP8-E5M2-to-BF16 2245.20 2116.50 0.943x
Qwen2.5-7B/FP8-E5M2-to-BF16 4096 3584 FP8-E5M2-to-BF16 4392.40 4353.40 0.991x
Qwen2.5-7B/FP8-E5M2-to-BF16 8192 3584 FP8-E5M2-to-BF16 5548.20 5512.70 0.994x
Qwen2.5-72B/BF16-to-FP8-E4M3 1024 8192 BF16-to-FP8-E4M3 1251.50 1255.90 1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3 2048 8192 BF16-to-FP8-E4M3 2256.70 2264.90 1.004x
Qwen2.5-72B/BF16-to-FP8-E4M3 4096 8192 BF16-to-FP8-E4M3 1811.40 1805.90 0.997x
Qwen2.5-72B/BF16-to-FP8-E4M3 8192 8192 BF16-to-FP8-E4M3 1902.90 1871.90 0.984x
Qwen2.5-72B/FP8-E4M3-to-BF16 1024 8192 FP8-E4M3-to-BF16 2540.70 2473.90 0.974x
Qwen2.5-72B/FP8-E4M3-to-BF16 2048 8192 FP8-E4M3-to-BF16 4730.20 4719.70 0.998x
Qwen2.5-72B/FP8-E4M3-to-BF16 4096 8192 FP8-E4M3-to-BF16 5728.30 5735.20 1.001x
Qwen2.5-72B/FP8-E4M3-to-BF16 8192 8192 FP8-E4M3-to-BF16 6587.30 6529.40 0.991x
Qwen2.5-72B/BF16-to-FP8-E5M2 1024 8192 BF16-to-FP8-E5M2 1252.70 1258.20 1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2 2048 8192 BF16-to-FP8-E5M2 2258.40 2267.30 1.004x
Qwen2.5-72B/BF16-to-FP8-E5M2 4096 8192 BF16-to-FP8-E5M2 1813.80 1802.50 0.994x
Qwen2.5-72B/BF16-to-FP8-E5M2 8192 8192 BF16-to-FP8-E5M2 1902.40 1873.90 0.985x
Qwen2.5-72B/FP8-E5M2-to-BF16 1024 8192 FP8-E5M2-to-BF16 2557.20 2488.80 0.973x
Qwen2.5-72B/FP8-E5M2-to-BF16 2048 8192 FP8-E5M2-to-BF16 4714.20 4714.40 1.000x
Qwen2.5-72B/FP8-E5M2-to-BF16 4096 8192 FP8-E5M2-to-BF16 5743.10 5715.60 0.995x
Qwen2.5-72B/FP8-E5M2-to-BF16 8192 8192 FP8-E5M2-to-BF16 6603.50 6591.70 0.998x
benchmark_gemm (median 1.001x, min 0.935x, max 1.094x)
Case M N K dtype TE Forward Base TE Forward PR TE Forward Speedup TE Backward Base TE Backward PR TE Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 1031.56 1044.11 1.012x 759.28 764.70 1.007x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 806.21 817.18 1.014x 474.12 469.24 0.990x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 1235.92 1218.47 0.986x 1218.56 1232.06 1.011x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 1090.59 1084.23 0.994x 1097.33 1091.06 0.994x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 1312.34 1303.96 0.994x 1054.29 1061.04 1.006x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 722.97 675.64 0.935x 1035.42 1014.34 0.980x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 1296.36 1295.50 0.999x 1364.19 1364.80 1.000x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 1172.73 1243.84 1.061x 1263.38 1218.60 0.965x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 1347.05 1329.77 0.987x 1243.39 1249.03 1.005x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 1458.15 1458.33 1.000x 1377.25 1375.15 0.998x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 1552.05 1553.90 1.001x 1426.28 1406.21 0.986x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 1596.08 1588.77 0.995x 1349.14 1344.52 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 1533.34 1537.14 1.002x 1261.16 1263.60 1.002x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 1531.56 1522.12 0.994x 1440.07 1437.99 0.999x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1531.03 1540.89 1.006x 1431.77 1441.42 1.007x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 1572.67 1566.68 0.996x 1380.70 1384.38 1.003x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 163.76 161.84 0.988x 85.91 88.00 1.024x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 106.91 107.50 1.006x 58.26 57.43 0.986x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 730.61 726.25 0.994x 400.66 408.95 1.021x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 375.78 373.99 0.995x 200.18 203.59 1.017x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 321.54 320.20 0.996x 174.12 174.79 1.004x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 216.19 213.79 0.989x 113.80 115.11 1.012x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 917.40 918.18 1.001x 973.01 990.20 1.018x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 757.40 745.23 0.984x 406.19 411.89 1.014x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 647.06 638.18 0.986x 343.95 347.77 1.011x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 432.88 439.18 1.015x 232.59 234.34 1.008x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 1249.15 1246.90 0.998x 1379.01 1378.43 1.000x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 1340.86 1351.01 1.008x 780.96 760.32 0.974x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 1011.14 1046.41 1.035x 753.62 740.72 0.983x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 864.57 861.49 0.996x 455.36 456.11 1.002x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 1331.39 1340.78 1.007x 1410.76 1401.87 0.994x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 1384.58 1401.17 1.012x 1118.15 1109.03 0.992x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 490.69 497.37 1.014x 291.37 292.00 1.002x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 434.30 432.02 0.995x 223.56 227.47 1.017x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 1071.37 1090.79 1.018x 1015.08 1024.66 1.009x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 1166.13 1168.12 1.002x 851.74 854.46 1.003x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 754.44 762.55 1.011x 630.23 620.15 0.984x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 885.39 893.65 1.009x 453.19 453.42 1.001x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 1405.45 1397.94 0.995x 1292.32 1292.95 1.000x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 1457.79 1453.66 0.997x 1163.03 1179.82 1.014x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 1025.43 1005.08 0.980x 1022.05 1010.27 0.988x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 1217.94 1220.35 1.002x 787.22 787.92 1.001x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 1430.36 1430.71 1.000x 1382.28 1389.21 1.005x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 1500.80 1505.96 1.003x 1381.77 1368.31 0.990x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 1303.64 1305.61 1.002x 1041.07 1052.20 1.011x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1222.76 1216.39 0.995x 1032.74 1058.58 1.025x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 1409.23 1393.83 0.989x 1420.60 1409.07 0.992x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 1528.03 1538.20 1.007x 1404.21 1406.70 1.002x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 904.21 898.17 0.993x 1060.39 1036.51 0.977x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 1374.74 1355.05 0.986x 944.64 946.03 1.001x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 1247.84 1250.03 1.002x 1266.71 1309.21 1.034x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 1557.68 1556.10 0.999x 1104.87 1098.97 0.995x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 942.92 939.66 0.997x 1158.83 1144.05 0.987x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 1405.33 1393.94 0.992x 1232.41 1250.84 1.015x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 1249.54 1235.35 0.989x 1396.91 1373.03 0.983x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 1563.50 1572.38 1.006x 1266.70 1275.01 1.007x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 1311.34 1309.76 0.999x 1178.44 1179.30 1.001x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 1459.78 1444.47 0.990x 1383.79 1359.37 0.982x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 1253.45 1254.54 1.001x 1424.14 1430.10 1.004x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 1563.16 1553.43 0.994x 1269.96 1281.38 1.009x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 1196.13 1192.27 0.997x 1197.02 1197.63 1.001x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 1444.72 1462.17 1.012x 1435.09 1405.15 0.979x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 1305.87 1295.48 0.992x 1433.19 1434.02 1.001x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 1562.92 1548.92 0.991x 1329.06 1312.97 0.988x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 834.45 839.43 1.006x 454.28 461.59 1.016x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 658.91 650.06 0.987x 350.44 359.77 1.027x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 1039.15 1057.32 1.017x 1105.16 1092.41 0.988x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 1135.23 1151.80 1.015x 929.26 941.09 1.013x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 1123.78 1145.57 1.019x 948.76 1038.27 1.094x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 950.35 993.29 1.045x 778.31 772.29 0.992x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 1207.31 1177.14 0.975x 1240.33 1205.00 0.972x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 1317.45 1314.54 0.998x 1103.15 1116.11 1.012x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 1176.15 1168.29 0.993x 1331.64 1348.98 1.013x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 1288.68 1299.03 1.008x 1263.45 1263.49 1.000x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 1229.85 1236.07 1.005x 1349.78 1346.28 0.997x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 1397.97 1397.86 1.000x 1302.88 1297.94 0.996x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 1292.97 1334.65 1.032x 1318.98 1309.79 0.993x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 1324.98 1306.60 0.986x 1223.76 1234.15 1.008x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 1423.73 1427.85 1.003x 1318.67 1314.88 0.997x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 1397.80 1397.58 1.000x 1288.21 1289.39 1.001x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 493.81 486.35 0.985x 286.99 293.77 1.024x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 432.26 434.02 1.004x 223.05 229.12 1.027x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 1074.15 1065.03 0.992x 1057.86 1059.16 1.001x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 607.86 584.66 0.962x 761.73 784.62 1.030x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 762.10 741.45 0.973x 620.88 645.59 1.040x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 878.78 884.70 1.007x 454.10 465.87 1.026x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 1336.88 1346.57 1.007x 1315.67 1324.50 1.007x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 1301.85 1310.54 1.007x 1046.29 1038.17 0.992x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 1032.78 1009.95 0.978x 1017.51 1018.11 1.001x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 1213.82 1223.14 1.008x 793.37 791.32 0.997x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 1390.24 1387.97 0.998x 1360.55 1358.70 0.999x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 1345.45 1339.22 0.995x 1263.22 1261.03 0.998x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 1319.24 1309.42 0.993x 1045.93 1056.10 1.010x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1240.01 1249.01 1.007x 1000.39 1012.30 1.012x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 1294.76 1291.01 0.997x 1362.93 1368.38 1.004x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 1373.41 1366.86 0.995x 1336.98 1333.71 0.998x
benchmark_gemm_fp8 (median 0.988x, min 0.401x, max 1.038x)
Case M N K dtype FP8 Forward Base FP8 Forward PR FP8 Forward Speedup FP8 Backward Base FP8 Backward PR FP8 Backward Speedup
Llama3-8B/TP1-QKV 1024 6144 4096 torch.bfloat16 416.89 407.31 0.977x 365.49 350.64 0.959x
Llama3-8B/TP1-AttnOut 1024 4096 4096 torch.bfloat16 286.54 273.39 0.954x 242.76 234.63 0.967x
Llama3-8B/TP1-GateUp 1024 28672 4096 torch.bfloat16 548.57 550.57 1.004x 1639.83 1702.61 1.038x
Llama3-8B/TP1-Down 1024 4096 14336 torch.bfloat16 446.49 445.96 0.999x 1540.76 1493.60 0.969x
Llama3-8B/TP1-QKV 2048 6144 4096 torch.bfloat16 846.58 809.82 0.957x 1432.76 680.41 0.475x
Llama3-8B/TP1-AttnOut 2048 4096 4096 torch.bfloat16 562.43 548.45 0.975x 1002.77 462.02 0.461x
Llama3-8B/TP1-GateUp 2048 28672 4096 torch.bfloat16 862.47 855.86 0.992x 1796.27 1792.72 0.998x
Llama3-8B/TP1-Down 2048 4096 14336 torch.bfloat16 713.95 715.87 1.003x 1802.56 1776.26 0.985x
Llama3-8B/TP1-QKV 4096 6144 4096 torch.bfloat16 1290.63 1271.35 0.985x 1840.90 1555.63 0.845x
Llama3-8B/TP1-AttnOut 4096 4096 4096 torch.bfloat16 1130.59 1089.03 0.963x 2051.26 905.90 0.442x
Llama3-8B/TP1-GateUp 4096 28672 4096 torch.bfloat16 1460.17 1467.62 1.005x 2154.53 2144.49 0.995x
Llama3-8B/TP1-Down 4096 4096 14336 torch.bfloat16 1014.90 1014.04 0.999x 2200.18 2194.11 0.997x
Llama3-8B/TP1-QKV 8192 6144 4096 torch.bfloat16 1771.48 1764.10 0.996x 1777.98 1780.02 1.001x
Llama3-8B/TP1-AttnOut 8192 4096 4096 torch.bfloat16 1516.59 1516.08 1.000x 1790.38 1780.72 0.995x
Llama3-8B/TP1-GateUp 8192 28672 4096 torch.bfloat16 1904.04 1901.36 0.999x 2185.83 2184.78 1.000x
Llama3-8B/TP1-Down 8192 4096 14336 torch.bfloat16 1287.22 1283.07 0.997x 2400.04 2391.69 0.997x
Llama3-8B/TP8-QKV 1024 768 4096 torch.bfloat16 51.30 49.23 0.960x 93.27 41.38 0.444x
Llama3-8B/TP8-AttnOut 1024 4096 512 torch.bfloat16 34.16 32.75 0.959x 61.68 27.38 0.444x
Llama3-8B/TP8-GateUp 1024 3584 4096 torch.bfloat16 236.53 227.33 0.961x 425.33 189.95 0.447x
Llama3-8B/TP8-Down 1024 4096 1792 torch.bfloat16 119.07 113.56 0.954x 213.74 94.65 0.443x
Llama3-8B/TP8-QKV 2048 768 4096 torch.bfloat16 102.02 97.55 0.956x 184.19 80.43 0.437x
Llama3-8B/TP8-AttnOut 2048 4096 512 torch.bfloat16 67.98 64.81 0.953x 109.52 53.81 0.491x
Llama3-8B/TP8-GateUp 2048 3584 4096 torch.bfloat16 472.76 454.97 0.962x 835.85 376.37 0.450x
Llama3-8B/TP8-Down 2048 4096 1792 torch.bfloat16 234.85 227.06 0.967x 421.16 186.50 0.443x
Llama3-8B/TP8-QKV 4096 768 4096 torch.bfloat16 201.66 194.86 0.966x 366.72 161.88 0.441x
Llama3-8B/TP8-AttnOut 4096 4096 512 torch.bfloat16 135.06 131.60 0.974x 241.29 107.82 0.447x
Llama3-8B/TP8-GateUp 4096 3584 4096 torch.bfloat16 943.94 900.94 0.954x 1673.36 746.75 0.446x
Llama3-8B/TP8-Down 4096 4096 1792 torch.bfloat16 475.53 456.38 0.960x 843.53 369.42 0.438x
Llama3-8B/TP8-QKV 8192 768 4096 torch.bfloat16 397.84 386.86 0.972x 717.45 318.65 0.444x
Llama3-8B/TP8-AttnOut 8192 4096 512 torch.bfloat16 266.86 259.11 0.971x 481.47 211.50 0.439x
Llama3-8B/TP8-GateUp 8192 3584 4096 torch.bfloat16 1287.71 1277.83 0.992x 1730.15 1763.56 1.019x
Llama3-8B/TP8-Down 8192 4096 1792 torch.bfloat16 934.26 899.90 0.963x 1184.55 735.82 0.621x
Llama3-70B/TP8-QKV 1024 1280 8192 torch.bfloat16 158.92 131.89 0.830x 282.81 141.65 0.501x
Llama3-70B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 133.15 129.72 0.974x 232.85 105.77 0.454x
Llama3-70B/TP8-GateUp 1024 7168 8192 torch.bfloat16 406.32 407.55 1.003x 1273.88 1300.77 1.021x
Llama3-70B/TP8-Down 1024 8192 3584 torch.bfloat16 463.08 449.43 0.971x 817.41 362.30 0.443x
Llama3-70B/TP8-QKV 2048 1280 8192 torch.bfloat16 332.69 324.45 0.975x 590.52 262.93 0.445x
Llama3-70B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 264.66 257.66 0.974x 470.04 206.70 0.440x
Llama3-70B/TP8-GateUp 2048 7168 8192 torch.bfloat16 766.83 777.71 1.014x 2069.73 2079.82 1.005x
Llama3-70B/TP8-Down 2048 8192 3584 torch.bfloat16 916.93 898.27 0.980x 1346.79 719.00 0.534x
Llama3-70B/TP8-QKV 4096 1280 8192 torch.bfloat16 626.26 626.45 1.000x 1185.91 518.39 0.437x
Llama3-70B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 520.77 508.45 0.976x 924.05 411.15 0.445x
Llama3-70B/TP8-GateUp 4096 7168 8192 torch.bfloat16 1040.12 1047.30 1.007x 2391.48 2396.51 1.002x
Llama3-70B/TP8-Down 4096 8192 3584 torch.bfloat16 1614.24 1604.80 0.994x 1550.40 1487.51 0.959x
Llama3-70B/TP8-QKV 8192 1280 8192 torch.bfloat16 571.96 565.82 0.989x 1728.83 1740.09 1.007x
Llama3-70B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 1044.13 1023.40 0.980x 690.52 690.23 1.000x
Llama3-70B/TP8-GateUp 8192 7168 8192 torch.bfloat16 1305.54 1309.23 1.003x 2321.50 2334.78 1.006x
Llama3-70B/TP8-Down 8192 8192 3584 torch.bfloat16 1995.64 2010.59 1.007x 1615.81 1619.20 1.002x
Llama3-405B/TP8-QKV 1024 2304 16384 torch.bfloat16 490.70 479.74 0.978x 891.25 470.88 0.528x
Llama3-405B/TP8-AttnOut 1024 16384 2048 torch.bfloat16 514.71 498.74 0.969x 848.12 391.64 0.462x
Llama3-405B/TP8-GateUp 1024 13312 16384 torch.bfloat16 579.98 577.80 0.996x 2366.75 2352.56 0.994x
Llama3-405B/TP8-Down 1024 16384 6656 torch.bfloat16 552.99 548.30 0.992x 1292.16 1300.64 1.007x
Llama3-405B/TP8-QKV 2048 2304 16384 torch.bfloat16 529.44 535.51 1.011x 1768.64 1536.04 0.868x
Llama3-405B/TP8-AttnOut 2048 16384 2048 torch.bfloat16 980.45 978.98 0.999x 965.67 743.78 0.770x
Llama3-405B/TP8-GateUp 2048 13312 16384 torch.bfloat16 901.21 899.02 0.998x 2624.82 2603.50 0.992x
Llama3-405B/TP8-Down 2048 16384 6656 torch.bfloat16 958.46 956.77 0.998x 1603.38 1592.63 0.993x
Llama3-405B/TP8-QKV 4096 2304 16384 torch.bfloat16 696.02 706.30 1.015x 2133.76 2121.95 0.994x
Llama3-405B/TP8-AttnOut 4096 16384 2048 torch.bfloat16 1396.06 1398.51 1.002x 1178.14 1162.43 0.987x
Llama3-405B/TP8-GateUp 4096 13312 16384 torch.bfloat16 1150.06 1151.27 1.001x 2754.81 2744.39 0.996x
Llama3-405B/TP8-Down 4096 16384 6656 torch.bfloat16 1445.70 1444.09 0.999x 1772.21 1764.32 0.996x
Llama3-405B/TP8-QKV 8192 2304 16384 torch.bfloat16 875.19 878.05 1.003x 2450.26 2451.96 1.001x
Llama3-405B/TP8-AttnOut 8192 16384 2048 torch.bfloat16 1895.28 1878.29 0.991x 1463.08 1464.39 1.001x
Llama3-405B/TP8-GateUp 8192 13312 16384 torch.bfloat16 1642.99 1643.30 1.000x 2818.38 2875.81 1.020x
Llama3-405B/TP8-Down 8192 16384 6656 torch.bfloat16 1781.32 1776.56 0.997x 1955.93 1950.84 0.997x
Qwen2.5-7B/TP1-QKV 1024 4608 3584 torch.bfloat16 245.20 239.18 0.975x 432.45 191.51 0.443x
Qwen2.5-7B/TP1-AttnOut 1024 3584 3584 torch.bfloat16 189.76 184.46 0.972x 334.31 148.58 0.444x
Qwen2.5-7B/TP1-GateUp 1024 37888 3584 torch.bfloat16 511.07 509.79 0.997x 1260.04 1254.61 0.996x
Qwen2.5-7B/TP1-Down 1024 3584 18944 torch.bfloat16 420.57 419.81 0.998x 1358.67 1353.00 0.996x
Qwen2.5-7B/TP1-QKV 2048 4608 3584 torch.bfloat16 486.59 472.44 0.971x 858.86 377.57 0.440x
Qwen2.5-7B/TP1-AttnOut 2048 3584 3584 torch.bfloat16 378.81 365.92 0.966x 668.03 291.22 0.436x
Qwen2.5-7B/TP1-GateUp 2048 37888 3584 torch.bfloat16 791.49 790.37 0.999x 1549.38 1545.83 0.998x
Qwen2.5-7B/TP1-Down 2048 3584 18944 torch.bfloat16 611.19 601.36 0.984x 1675.32 1698.05 1.014x
Qwen2.5-7B/TP1-QKV 4096 4608 3584 torch.bfloat16 974.46 935.89 0.960x 1692.07 728.21 0.430x
Qwen2.5-7B/TP1-AttnOut 4096 3584 3584 torch.bfloat16 713.30 727.26 1.020x 1313.12 563.62 0.429x
Qwen2.5-7B/TP1-GateUp 4096 37888 3584 torch.bfloat16 1159.85 1160.52 1.001x 1717.77 1715.36 0.999x
Qwen2.5-7B/TP1-Down 4096 3584 18944 torch.bfloat16 906.83 903.52 0.996x 1976.89 1979.59 1.001x
Qwen2.5-7B/TP1-QKV 8192 4608 3584 torch.bfloat16 1374.94 1364.16 0.992x 1517.27 1528.44 1.007x
Qwen2.5-7B/TP1-AttnOut 8192 3584 3584 torch.bfloat16 1264.20 1249.55 0.988x 1626.07 1194.42 0.735x
Qwen2.5-7B/TP1-GateUp 8192 37888 3584 torch.bfloat16 1896.52 1900.24 1.002x 1792.03 1776.86 0.992x
Qwen2.5-7B/TP1-Down 8192 3584 18944 torch.bfloat16 1112.18 1110.88 0.999x 2114.34 2109.42 0.998x
Qwen2.5-72B/TP8-QKV 1024 1280 8192 torch.bfloat16 146.27 142.46 0.974x 266.87 113.15 0.424x
Qwen2.5-72B/TP8-AttnOut 1024 8192 1024 torch.bfloat16 122.70 118.23 0.964x 214.73 91.38 0.426x
Qwen2.5-72B/TP8-GateUp 1024 7392 8192 torch.bfloat16 385.64 382.08 0.991x 1041.60 1047.27 1.005x
Qwen2.5-72B/TP8-Down 1024 8192 3696 torch.bfloat16 401.29 400.32 0.998x 840.16 336.64 0.401x
Qwen2.5-72B/TP8-QKV 2048 1280 8192 torch.bfloat16 307.29 293.19 0.954x 531.19 228.12 0.429x
Qwen2.5-72B/TP8-AttnOut 2048 8192 1024 torch.bfloat16 245.11 233.32 0.952x 421.54 181.18 0.430x
Qwen2.5-72B/TP8-GateUp 2048 7392 8192 torch.bfloat16 735.21 735.24 1.000x 1758.15 1741.20 0.990x
Qwen2.5-72B/TP8-Down 2048 8192 3696 torch.bfloat16 662.87 662.19 0.999x 1068.35 729.50 0.683x
Qwen2.5-72B/TP8-QKV 4096 1280 8192 torch.bfloat16 596.74 577.85 0.968x 1057.04 447.00 0.423x
Qwen2.5-72B/TP8-AttnOut 4096 8192 1024 torch.bfloat16 483.70 465.44 0.962x 847.84 358.43 0.423x
Qwen2.5-72B/TP8-GateUp 4096 7392 8192 torch.bfloat16 983.66 990.29 1.007x 1957.53 1957.75 1.000x
Qwen2.5-72B/TP8-Down 4096 8192 3696 torch.bfloat16 962.46 963.35 1.001x 1324.31 1329.60 1.004x
Qwen2.5-72B/TP8-QKV 8192 1280 8192 torch.bfloat16 560.82 554.91 0.989x 1485.29 1465.89 0.987x
Qwen2.5-72B/TP8-AttnOut 8192 8192 1024 torch.bfloat16 935.32 924.77 0.989x 698.43 699.77 1.002x
Qwen2.5-72B/TP8-GateUp 8192 7392 8192 torch.bfloat16 1215.17 1220.05 1.004x 2158.54 2146.23 0.994x
Qwen2.5-72B/TP8-Down 8192 8192 3696 torch.bfloat16 1246.18 1242.25 0.997x 1452.22 1446.39 0.996x
benchmark_grouped_gemm (median 0.999x, min 0.912x, max 1.134x)
Case B M N K dtype TE (CK_Tile) Forward Base TE (CK_Tile) Forward PR TE (CK_Tile) Forward Speedup TE (CK_Tile) Backward Base TE (CK_Tile) Backward PR TE (CK_Tile) Backward Speedup
DSV2-Lite-GateUP 2 512 2816 2048 torch.bfloat16 241.31 241.50 1.001x 165.25 174.21 1.054x
DSV2-Lite-Down 2 512 2048 1408 torch.bfloat16 157.97 157.66 0.998x 164.25 163.75 0.997x
DSV2-Lite-GateUP 2 1024 2816 2048 torch.bfloat16 470.04 472.15 1.004x 317.46 300.52 0.947x
DSV2-Lite-Down 2 1024 2048 1408 torch.bfloat16 311.59 309.83 0.994x 293.90 294.03 1.000x
DSV2-Lite-GateUP 2 2048 2816 2048 torch.bfloat16 836.18 816.91 0.977x 542.45 535.42 0.987x
DSV2-Lite-Down 2 2048 2048 1408 torch.bfloat16 588.03 593.84 1.010x 487.69 488.22 1.001x
DSV2-Lite-GateUP 2 4096 2816 2048 torch.bfloat16 862.06 854.77 0.992x 821.46 829.75 1.010x
DSV2-Lite-Down 2 4096 2048 1408 torch.bfloat16 804.83 913.05 1.134x 535.44 535.77 1.001x
DSV2-Lite-GateUP 4 512 2816 2048 torch.bfloat16 464.77 466.73 1.004x 298.10 297.98 1.000x
DSV2-Lite-Down 4 512 2048 1408 torch.bfloat16 303.65 303.31 0.999x 271.82 247.78 0.912x
DSV2-Lite-GateUP 4 1024 2816 2048 torch.bfloat16 802.06 800.88 0.999x 509.17 503.26 0.988x
DSV2-Lite-Down 4 1024 2048 1408 torch.bfloat16 584.50 588.16 1.006x 457.76 459.09 1.003x
DSV2-Lite-GateUP 4 2048 2816 2048 torch.bfloat16 841.69 837.09 0.995x 758.58 760.67 1.003x
DSV2-Lite-Down 4 2048 2048 1408 torch.bfloat16 915.53 915.39 1.000x 506.16 503.30 0.994x
DSV2-Lite-GateUP 4 4096 2816 2048 torch.bfloat16 1035.61 1032.06 0.997x 814.15 814.28 1.000x
DSV2-Lite-Down 4 4096 2048 1408 torch.bfloat16 976.74 968.55 0.992x 617.04 611.33 0.991x
DSV2-Lite-GateUP 8 512 2816 2048 torch.bfloat16 811.88 803.25 0.989x 497.07 496.35 0.999x
DSV2-Lite-Down 8 512 2048 1408 torch.bfloat16 562.85 551.32 0.980x 437.47 435.13 0.995x
DSV2-Lite-GateUP 8 1024 2816 2048 torch.bfloat16 811.96 819.62 1.009x 765.54 779.72 1.019x
DSV2-Lite-Down 8 1024 2048 1408 torch.bfloat16 871.38 875.18 1.004x 487.58 512.78 1.052x
DSV2-Lite-GateUP 8 2048 2816 2048 torch.bfloat16 1009.45 1011.40 1.002x 840.10 847.07 1.008x
DSV2-Lite-Down 8 2048 2048 1408 torch.bfloat16 936.19 934.18 0.998x 623.49 625.78 1.004x
DSV2-Lite-GateUP 8 4096 2816 2048 torch.bfloat16 1018.90 1015.76 0.997x 874.56 873.36 0.999x
DSV2-Lite-Down 8 4096 2048 1408 torch.bfloat16 994.19 994.77 1.001x 687.93 692.68 1.007x
DSV2-GateUP 5 512 3072 5120 torch.bfloat16 758.10 748.07 0.987x 651.82 649.44 0.996x
DSV2-Down 5 512 5120 1536 torch.bfloat16 805.76 782.73 0.971x 309.70 309.18 0.998x
DSV2-GateUP 5 1024 3072 5120 torch.bfloat16 1055.65 1115.60 1.057x 726.08 738.29 1.017x
DSV2-Down 5 1024 5120 1536 torch.bfloat16 860.51 840.11 0.976x 530.47 526.90 0.993x
DSV2-GateUP 5 2048 3072 5120 torch.bfloat16 1117.45 1107.97 0.992x 791.38 788.22 0.996x
DSV2-Down 5 2048 5120 1536 torch.bfloat16 862.90 864.92 1.002x 801.96 794.04 0.990x
DSV2-GateUP 5 4096 3072 5120 torch.bfloat16 1150.31 1146.38 0.997x 893.54 895.39 1.002x
DSV2-Down 5 4096 5120 1536 torch.bfloat16 975.09 960.36 0.985x 833.25 830.31 0.996x
DSV2-GateUP 10 512 3072 5120 torch.bfloat16 1005.55 983.90 0.978x 643.24 639.69 0.994x
DSV2-Down 10 512 5120 1536 torch.bfloat16 833.59 807.19 0.968x 491.60 485.48 0.988x
DSV2-GateUP 10 1024 3072 5120 torch.bfloat16 1055.26 1053.50 0.998x 751.88 751.79 1.000x
DSV2-Down 10 1024 5120 1536 torch.bfloat16 831.97 826.24 0.993x 760.98 759.32 0.998x
DSV2-GateUP 10 2048 3072 5120 torch.bfloat16 1117.79 1101.13 0.985x 853.62 861.91 1.010x
DSV2-Down 10 2048 5120 1536 torch.bfloat16 923.67 929.06 1.006x 814.38 820.78 1.008x
DSV2-GateUP 10 4096 3072 5120 torch.bfloat16 1144.04 1135.59 0.993x 909.52 908.71 0.999x
DSV2-Down 10 4096 5120 1536 torch.bfloat16 987.12 984.72 0.998x 870.50 872.74 1.003x
DSV2-GateUP 20 512 3072 5120 torch.bfloat16 969.08 977.88 1.009x 632.85 635.43 1.004x
DSV2-Down 20 512 5120 1536 torch.bfloat16 770.09 762.78 0.991x 629.63 629.36 1.000x
DSV2-GateUP 20 1024 3072 5120 torch.bfloat16 1033.21 1038.29 1.005x 794.20 791.82 0.997x
DSV2-Down 20 1024 5120 1536 torch.bfloat16 876.91 872.56 0.995x 750.70 742.67 0.989x
DSV2-GateUP 20 2048 3072 5120 torch.bfloat16 1087.90 1080.25 0.993x 870.51 870.66 1.000x
DSV2-Down 20 2048 5120 1536 torch.bfloat16 953.93 952.25 0.998x 818.09 823.26 1.006x
DSV2-GateUP 20 4096 3072 5120 torch.bfloat16 1157.31 1155.49 0.998x 891.78 898.10 1.007x
DSV2-Down 20 4096 5120 1536 torch.bfloat16 989.75 991.38 1.002x 858.18 860.10 1.002x
DSV3-GateUP 8 512 4096 7168 torch.bfloat16 1050.53 1059.97 1.009x 687.06 685.03 0.997x
DSV3-Down 8 512 7168 2048 torch.bfloat16 909.32 901.57 0.991x 517.39 519.84 1.005x
DSV3-GateUP 8 1024 4096 7168 torch.bfloat16 1133.68 1134.79 1.001x 812.56 811.88 0.999x
DSV3-Down 8 1024 7168 2048 torch.bfloat16 963.17 971.34 1.008x 820.30 827.47 1.009x
DSV3-GateUP 8 2048 4096 7168 torch.bfloat16 1181.85 1179.63 0.998x 926.65 921.78 0.995x
DSV3-Down 8 2048 7168 2048 torch.bfloat16 1053.68 1049.93 0.996x 885.84 879.64 0.993x
DSV3-GateUP 8 4096 4096 7168 torch.bfloat16 1210.41 1201.53 0.993x 960.94 962.05 1.001x
DSV3-Down 8 4096 7168 2048 torch.bfloat16 1083.47 1085.41 1.002x 934.58 934.96 1.000x
DSV3-GateUP 16 512 4096 7168 torch.bfloat16 1040.65 1030.43 0.990x 668.79 669.72 1.001x
DSV3-Down 16 512 7168 2048 torch.bfloat16 901.13 900.01 0.999x 697.83 695.37 0.996x
DSV3-GateUP 16 1024 4096 7168 torch.bfloat16 1127.35 1119.62 0.993x 835.60 834.86 0.999x
DSV3-Down 16 1024 7168 2048 torch.bfloat16 1005.05 1003.60 0.999x 813.51 806.73 0.992x
DSV3-GateUP 16 2048 4096 7168 torch.bfloat16 1165.20 1166.80 1.001x 919.75 919.29 0.999x
DSV3-Down 16 2048 7168 2048 torch.bfloat16 1053.64 1051.81 0.998x 886.32 884.69 0.998x
DSV3-GateUP 16 4096 4096 7168 torch.bfloat16 1207.24 1209.12 1.002x 940.29 946.10 1.006x
DSV3-Down 16 4096 7168 2048 torch.bfloat16 1078.18 1077.07 0.999x 909.47 914.80 1.006x
DSV3-GateUP 32 512 4096 7168 torch.bfloat16 1016.13 1017.20 1.001x 664.62 661.71 0.996x
DSV3-Down 32 512 7168 2048 torch.bfloat16 927.19 906.34 0.978x 665.70 671.98 1.009x
DSV3-GateUP 32 1024 4096 7168 torch.bfloat16 1106.20 1102.44 0.997x 809.68 809.12 0.999x
DSV3-Down 32 1024 7168 2048 torch.bfloat16 976.05 981.99 1.006x 793.37 795.26 1.002x
DSV3-GateUP 32 2048 4096 7168 torch.bfloat16 1157.14 1155.39 0.998x 889.82 884.26 0.994x
DSV3-Down 32 2048 7168 2048 torch.bfloat16 1018.36 1024.81 1.006x 873.57 874.36 1.001x
DSV3-GateUP 32 4096 4096 7168 torch.bfloat16 1181.02 1182.44 1.001x 908.43 907.72 0.999x
DSV3-Down 32 4096 7168 2048 torch.bfloat16 1049.28 1047.24 0.998x 903.45 881.54 0.976x
Grok-V2-GateUP 1 512 32768 8192 torch.bfloat16 1016.40 1012.02 0.996x 885.37 898.65 1.015x
Grok-V2-Down 1 512 8192 16384 torch.bfloat16 770.72 764.61 0.992x 910.54 933.63 1.025x
Grok-V2-GateUP 1 1024 32768 8192 torch.bfloat16 1400.59 1427.92 1.020x 1220.00 1232.17 1.010x
Grok-V2-Down 1 1024 8192 16384 torch.bfloat16 1132.22 1171.32 1.035x 1207.33 1225.05 1.015x
Grok-V2-GateUP 1 2048 32768 8192 torch.bfloat16 1446.72 1448.37 1.001x 1374.03 1371.83 0.998x
Grok-V2-Down 1 2048 8192 16384 torch.bfloat16 1485.19 1483.65 0.999x 1338.63 1354.75 1.012x
Grok-V2-GateUP 1 4096 32768 8192 torch.bfloat16 1460.68 1475.09 1.010x 1415.70 1414.40 0.999x
Grok-V2-Down 1 4096 8192 16384 torch.bfloat16 1501.04 1499.14 0.999x 1401.87 1399.71 0.998x
benchmark_normalization (median 0.993x, min 0.399x, max 1.490x)
Case M hidden_size dtype TE Forward GB/s Base TE Forward GB/s PR TE Forward GB/s Speedup TE Backward GB/s Base TE Backward GB/s PR TE Backward GB/s Speedup
Llama3-8B/RMSNorm 1024 4096 torch.bfloat16 638.60 624.90 0.979x 485.60 723.50 1.490x
Llama3-8B/RMSNorm 2048 4096 torch.bfloat16 1305.50 1272.80 0.975x 1425.70 1455.90 1.021x
Llama3-8B/RMSNorm 4096 4096 torch.bfloat16 2599.10 2535.90 0.976x 2952.10 2924.70 0.991x
Llama3-8B/RMSNorm 8192 4096 torch.bfloat16 5199.00 5077.80 0.977x 5496.30 5679.60 1.033x
Llama3-8B/LayerNorm 1024 4096 torch.bfloat16 552.20 555.60 1.006x 624.50 633.10 1.014x
Llama3-8B/LayerNorm 2048 4096 torch.bfloat16 1075.00 1110.80 1.033x 1271.70 1262.10 0.992x
Llama3-8B/LayerNorm 4096 4096 torch.bfloat16 2223.10 2195.00 0.987x 2508.00 2549.20 1.016x
Llama3-8B/LayerNorm 8192 4096 torch.bfloat16 4425.10 4423.10 1.000x 5060.20 5065.30 1.001x
Llama3-70B/RMSNorm 1024 8192 torch.bfloat16 1307.50 1294.10 0.990x 1448.60 1450.50 1.001x
Llama3-70B/RMSNorm 2048 8192 torch.bfloat16 2637.30 2578.10 0.978x 2786.50 2957.50 1.061x
Llama3-70B/RMSNorm 4096 8192 torch.bfloat16 4389.30 4381.20 0.998x 4916.80 4935.30 1.004x
Llama3-70B/RMSNorm 8192 8192 torch.bfloat16 4993.60 5036.40 1.009x 5392.30 5367.10 0.995x
Llama3-70B/LayerNorm 1024 8192 torch.bfloat16 1100.70 1124.10 1.021x 730.40 730.30 1.000x
Llama3-70B/LayerNorm 2048 8192 torch.bfloat16 2210.50 2195.90 0.993x 747.90 739.50 0.989x
Llama3-70B/LayerNorm 4096 8192 torch.bfloat16 4224.90 4258.50 1.008x 644.60 609.30 0.945x
Llama3-70B/LayerNorm 8192 8192 torch.bfloat16 4865.30 4836.00 0.994x 579.90 588.60 1.015x
Llama3-405B/RMSNorm 1024 16384 torch.bfloat16 660.50 651.70 0.987x 556.20 547.20 0.984x
Llama3-405B/RMSNorm 2048 16384 torch.bfloat16 705.90 707.90 1.003x 541.70 541.10 0.999x
Llama3-405B/RMSNorm 4096 16384 torch.bfloat16 725.70 736.10 1.014x 455.20 452.10 0.993x
Llama3-405B/RMSNorm 8192 16384 torch.bfloat16 581.40 581.80 1.001x 487.00 489.20 1.005x
Llama3-405B/LayerNorm 1024 16384 torch.bfloat16 675.60 691.60 1.024x 643.00 623.30 0.969x
Llama3-405B/LayerNorm 2048 16384 torch.bfloat16 690.30 690.80 1.001x 606.70 561.50 0.925x
Llama3-405B/LayerNorm 4096 16384 torch.bfloat16 702.70 696.50 0.991x 512.50 518.50 1.012x
Llama3-405B/LayerNorm 8192 16384 torch.bfloat16 562.60 563.80 1.002x 576.20 563.10 0.977x
Qwen2.5-7B/RMSNorm 1024 3584 torch.bfloat16 493.10 549.00 1.113x 398.80 253.20 0.635x
Qwen2.5-7B/RMSNorm 2048 3584 torch.bfloat16 1142.60 1104.60 0.967x 689.00 499.00 0.724x
Qwen2.5-7B/RMSNorm 4096 3584 torch.bfloat16 2252.90 2214.20 0.983x 1127.90 1013.50 0.899x
Qwen2.5-7B/RMSNorm 8192 3584 torch.bfloat16 3070.90 3066.90 0.999x 1819.80 1663.20 0.914x
Qwen2.5-7B/LayerNorm 1024 3584 torch.bfloat16 481.50 479.50 0.996x 348.70 218.50 0.627x
Qwen2.5-7B/LayerNorm 2048 3584 torch.bfloat16 960.40 943.90 0.983x 639.10 439.40 0.688x
Qwen2.5-7B/LayerNorm 4096 3584 torch.bfloat16 1900.00 1869.40 0.984x 1050.80 868.40 0.826x
Qwen2.5-7B/LayerNorm 8192 3584 torch.bfloat16 2688.60 2703.50 1.006x 1581.20 1549.50 0.980x
Qwen2.5-72B/RMSNorm 1024 8192 torch.bfloat16 1284.00 1248.70 0.973x 1442.30 576.10 0.399x
Qwen2.5-72B/RMSNorm 2048 8192 torch.bfloat16 2618.00 2506.80 0.958x 2917.20 1206.50 0.414x
Qwen2.5-72B/RMSNorm 4096 8192 torch.bfloat16 4393.80 4379.10 0.997x 4933.10 2493.20 0.505x
Qwen2.5-72B/RMSNorm 8192 8192 torch.bfloat16 5035.00 4997.90 0.993x 5387.60 5318.40 0.987x
Qwen2.5-72B/LayerNorm 1024 8192 torch.bfloat16 1108.30 1090.60 0.984x 745.10 486.50 0.653x
Qwen2.5-72B/LayerNorm 2048 8192 torch.bfloat16 2238.60 2231.80 0.997x 750.00 744.20 0.992x
Qwen2.5-72B/LayerNorm 4096 8192 torch.bfloat16 4254.00 4264.80 1.003x 647.00 612.20 0.946x
Qwen2.5-72B/LayerNorm 8192 8192 torch.bfloat16 4741.10 4861.60 1.025x 581.40 589.60 1.014x

This comment was marked as outdated.

@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@ROCm ROCm deleted a comment from Copilot AI Mar 12, 2026
@matthiasdiener matthiasdiener changed the title Microbenchmarking, CSV-based Microbenchmarking, Torch+CSV-based May 11, 2026
Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general comments in addition to the inline:

  1. Regarding copyright, some spots are 2026 only while others are 2025-2026 -- is there a specific reason, or can we be specific and only set 2026?
  2. It seems that dtype=torch.bfloat16 is hard-coded -- can we generalize to allow for e.g. fp16 benchmarks?
  3. Can we document the bench_fn contract so that it's easier for new developers to contribute additional benchmarks?
  4. Can we have a more general RECIPES dict similar to NV (
    RECIPES = {
    "bf16": None,
    "fp8_sub_channel": Float8BlockScaling(),
    "mxfp8": MXFP8BlockScaling(),
    "nvfp4": NVFP4BlockScaling(),
    }
    )
  5. Can we add a README.md to document?

Comment thread benchmarks/microbenchmarks/benchmark_gemm.py Outdated
Comment thread benchmarks/microbenchmarks/benchmark_grouped_gemm.py Outdated
Comment thread benchmarks/microbenchmarks/compare_results.py Outdated
Comment thread benchmarks/microbenchmarks/benchmark_gemm_fp8.py
Comment thread benchmarks/microbenchmarks/utils.py Outdated
Comment thread benchmarks/microbenchmarks/utils.py
Comment thread benchmarks/microbenchmarks/benchmark_grouped_gemm.py
Comment thread benchmarks/microbenchmarks/compare_results.py Outdated
@matthiasdiener
Copy link
Copy Markdown
Contributor Author

  1. Regarding copyright, some spots are 2026 only while others are 2025-2026 -- is there a specific reason, or can we be specific and only set 2026?

Changed to 2026 in 284adda.

  1. It seems that dtype=torch.bfloat16 is hard-coded -- can we generalize to allow for e.g. fp16 benchmarks?

Done in 284adda.

  1. Can we document the bench_fn contract so that it's easier for new developers to contribute additional benchmarks?

Added documentation in utils.py and README.

  1. Can we have a more general RECIPES dict similar to NV (
    RECIPES = {
    "bf16": None,
    "fp8_sub_channel": Float8BlockScaling(),
    "mxfp8": MXFP8BlockScaling(),
    "nvfp4": NVFP4BlockScaling(),
    }

    )

Added in 284adda.

  1. Can we add a README.md to document?

Added in ca1f442.

@matthiasdiener matthiasdiener marked this pull request as ready for review May 11, 2026 20:50
]


def _generate_cast_test_cases():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm, we are already in benchmark_casting.py so _generate_test_cases() should suffice? More generally, I saw each benchmark scripts have different functions or names for test case setups. Is it possible to unify them or just follow a unified pattern?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the cast and norm functions to _generate_test_cases in abefa95.

Comment thread benchmarks/microbenchmarks/utils.py Outdated
Comment on lines +91 to +100
def time_func(fn, method="adaptive", min_run_time=DEFAULT_MIN_RUN_TIME_SECONDS):
"""Time *fn* and return elapsed milliseconds.

method: "adaptive" uses adaptive_autorange (good for compute-bound),
"blocked" uses blocked_autorange (good for memory-bound).
"""
timer = benchmark.Timer(stmt="fn()", globals={"fn": fn})
if method == "blocked":
return timer.blocked_autorange(min_run_time=min_run_time).mean * 1e3
return timer.adaptive_autorange(min_run_time=min_run_time).mean * 1e3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the csv outputs you record only the means -- I think it would be useful to be able to save the underlying samples and individual runtimes for downstream analysis.

Copy link
Copy Markdown
Contributor Author

@matthiasdiener matthiasdiener May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. What do you think of 39b5720, which adds a new argument (--csv-samples), and stores the samples into a separate csv file?

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment, otherwise LGTM.

@matthiasdiener matthiasdiener requested a review from Micky774 May 29, 2026 20:57
@matthiasdiener matthiasdiener added the ci-level 1 CI test level 1 label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants