ASV-format microbenchmark suite by Micky774 · Pull Request #487 · ROCm/TransformerEngine

Micky774 · 2026-03-16T23:20:12Z

Description

This PR is a port of #478.

This PR uses a central driver to parse and run individual benchmark-defining scripts. The driver provides a function that can be imported and used by the individual scripts to make them self-sufficient and runnable. The benchmarks themselves, and the driver, have no hard ASV dependency. Instead, they simply produce results in an ASV-compatible format for later consumption.

ASV is only used for result tracking, visualization, and publishing. A helper bash script is provided to wrap the ASV commands for convenience (as well as offering a wrapper on the main driver script).

Follow-up Work

In future PRs we will:

extend benchmarking to new ops
re-evaluate bench configs and scope
update attention benchmarking to reach parity with the JAX FA benchmarking tool (mainly so we have persistent regression tracking)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds benchmarks
Adds README.md for documentation
Adds driver script
Adds helper bash script to wrap driver and ASV

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Micky774 · 2026-03-17T14:52:50Z

Note the CI failure is unrelated

Micky774 · 2026-03-18T21:44:28Z

I've added a helper script like @alextmagro had suggested, as well as corresponding documentation to the README.md.

ipanfilo · 2026-03-19T18:01:29Z

          EOF
          )"

+      - name: Restore previous ASV results


I think benchmarks should go separate workflow from CI. I.e. these microbenchmarks and ones that are already run with CI

Will doing so require a separate TE build and setup? I added it here so that we'd piggy-back off of already running CI.

ipanfilo · 2026-03-19T18:19:03Z

+
+          # Derive a stable machine name from the runner label
+          case "${RUNNER_NAME}" in
+            linux-te-mi325*) MACHINE_NAME="mi325" ;;


Why do we need it if results are uploaded with just matrix.runner name?

So, my understanding is that the matrix.runner name is not 1-1 with the underlying system, i.e. different systems with different machine names can be part of a pool with the same runner name. ASV by default stores results by machine name. Here, we are manually specifying a generic machine name indexed by gpu arch so that each e.g. mi325 runner will store its results in a compatible way.

Ideally, we have dedicated machines for benchmarking (since this would likely be every commit or nightly even), but that's a constraint we'll need to discuss.

ipanfilo · 2026-03-19T18:19:36Z

+          set -ex
+          pip install asv
+          cd /workspace
+          asv machine --yes --machine "$MACHINE_NAME"


Will it re-register machine if it exists already?

Yes, but it's registered in the container so it's transient

ipanfilo · 2026-03-31T04:09:31Z

+# Helper script for common ASV benchmark tasks.
+set -euo pipefail
+
+cd "$(git rev-parse --show-toplevel)"


(1) On CI it may fail with dubious ownership error. (2) If current directory is not a part of git tree, it fails too. So it is better to determine BENCH_DIR as directory where current script is located

ipanfilo · 2026-03-31T04:10:15Z

+
+case "${1:-}" in
+    setup)
+        MACHINE="${2:-$(hostname)}"


Are hostnames used for CI runners persistent?

matthiasdiener · 2026-04-15T18:46:52Z

+    parser.add_argument("-w", "--warmup", type=int, default=3,
+                        help="Number of warmup iterations (default: 3)")
+    parser.add_argument("-n", "--iters", type=int, default=7,
+                        help="Number of timed iterations (default: 7)")


3/7 iterations is quite low, for microbenchmarks we could probably use much higher numbers (like 50/50) by default.

As an example, for the second shape in bench_gemm I sometimes get 0.127ms as the median value with 3/7, while 50/50 is pretty consistently at 0.111ms.

matthiasdiener · 2026-04-15T18:53:21Z

+|---|---|
+| `setup [name]` | Register machine with ASV (defaults to `hostname`) |
+| `run [suite] [method]` | Run benchmarks in-process (fast, saves ASV-compatible results) |
+| `run --asv [suite]` | Run via ASV subprocess isolation (for CI or statistical rigor) |


If we do keep the functionality to run benchmarks with asv, can we make this a driver.py option?

I've actually now trimmed it so that we only run directly, simplifying the whole thing a bit. ASV is used strictly for publishing/viewing and regression tracking.

matthiasdiener · 2026-04-15T20:23:22Z

+
+            # Derive throughput from work_* companion
+            work = {}
+            wfn = getattr(instance, "work_" + method_name[5:], None)


Are the "work"/tflops values stored anywhere? I could only find them printed on stdout via the direct run, but not stored. The work_ methods seems to be unused for asv.

matthiasdiener · 2026-04-15T20:25:31Z

+  setup                 Register this machine with ASV
+  run [-w W] [-n N] [SUITE] [METHOD]
+                        Run benchmarks in-process (fast, saves ASV-compatible results)
+  run --asv [SUITE]     Run benchmarks via ASV (subprocess isolation per benchmark)


The results seem to differ significantly between running with asv and running directly. E.g., the first three shapes in bench_gemm.py return:

[62.50%] ··· bench_gemm.BenchGemm.time_forward ok [62.50%] ··· ====== ========================= ============= M shape ------ ------------------------- ------------- 1024 Llama3-8B_TP1-QKV 160±0.4μs 1024 Llama3-8B_TP1-AttnOut 129±1μs 1024 Llama3-8B_TP1-GateUp 460±1μs

with asv, and

---------------------------------------------------------------------------------------------------------------------------------------- median mean stdev q25 q75 min max TFLOPS method params ---------------------------------------------------------------------------------------------------------------------------------------- 0.147ms 0.147ms 0.002ms 0.146ms 0.148ms 0.145ms 0.151ms 350.2 time_forward M=1024, shape=Llama3-8B_TP1-QKV 0.120ms 0.120ms 0.007ms 0.112ms 0.130ms 0.110ms 0.130ms 287.1 time_forward M=1024, shape=Llama3-8B_TP1-AttnOut 0.405ms 0.406ms 0.008ms 0.397ms 0.410ms 0.395ms 0.422ms 594.6 time_forward M=1024, shape=Llama3-8B_TP1-GateUp

with the direct method.

matthiasdiener · 2026-04-15T20:27:29Z

@@ -0,0 +1,97 @@
+#!/usr/bin/env python3


Instead of creating a new attention microbenchmark, should we use the attention microbenchmark(s) already part of TE (in https://github.com/ROCm/TransformerEngine/tree/dev/benchmarks/attention)?

matthiasdiener · 2026-04-15T20:47:52Z

+    """Return (median, mean, stdev, ci_lo, ci_hi, q25, q75) for *samples*."""
+    s = sorted(samples)
+    n = len(s)
+    mean = sum(s) / n


Why not use statistics.median() etc. here? The median calculated here (s[n//2]) is probably not correct, for even numbers of elements it should be the average of the middle elements.

A similar issue exists for stdev, I think ( /n vs. /(n-1)).

matthiasdiener · 2026-04-15T21:38:35Z

+Forward FLOPs  = 4 * batch * num_q_heads * seq_len^2 * head_dim
+Backward FLOPs ~ 2x forward


Repeated from lines 17-19 in this file.

matthiasdiener · 2026-04-15T21:41:45Z

+                "stats_ci_99_a", "stats_ci_99_b",
+                "stats_q_25", "stats_q_75",
+                "stats_number", "stats_repeat",
+                "samples",


What's the meaning of samples here? Is it ever written?

Micky774 added 2 commits March 16, 2026 18:02

Initial benchmark porting to ASV

d7c643c

Update casting benchmark

b829122

Micky774 marked this pull request as ready for review March 17, 2026 13:58

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 17, 2026 13:58

Micky774 mentioned this pull request Mar 17, 2026

Microbenchmarking, CSV-based #478

Draft

16 tasks

Added helper script and documentation

21678b4

Corrected local benchmarking

6cb91a5

ipanfilo reviewed Mar 19, 2026

View reviewed changes

Micky774 added 9 commits March 19, 2026 13:28

Added direct-run option to bypass subprocess overhead

1a98989

Refactor to prefer direct runs, and moved asv conf

498f16d

Allowed for direct run of bench files

1e41715

Remove CI component

c1e489d

Rename direct_run to driver

9772f2d

Refactored driver, streamlined README.md

770a3f0

Updated to CUDA event based timing

aa2a4a1

Added throughput/bandwidth calc, improved driver

a2e5999

Streamline and clean code

89ebfa5

Micky774 force-pushed the zain/asv-demo branch from de4d412 to 89ebfa5 Compare March 24, 2026 17:38

Updated readme, simplified helper script

91b6b2c

Micky774 changed the title ~~ASV demo~~ ASV-format microbenchmark suite Mar 24, 2026

Micky774 requested review from ipanfilo and matthiasdiener March 24, 2026 19:03

Updated docstrings to include config sources

1b5d042

ipanfilo reviewed Mar 31, 2026

View reviewed changes

Micky774 and others added 2 commits April 1, 2026 13:12

Added missing var

c6df4d7

Merge remote-tracking branch 'origin/dev' into zain/asv-demo

37b8f0d

matthiasdiener reviewed Apr 15, 2026

View reviewed changes

Micky774 added 2 commits April 21, 2026 17:23

Added cold-cache support as well as inner runs for launch amortization

29465a4

Trimmed implementation to only use ASV for dashboard

8c23f72

		Forward FLOPs = 4 * batch * num_q_heads * seq_len^2 * head_dim
		Backward FLOPs ~ 2x forward

Conversation

Micky774 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Follow-up Work

Type of change

Changes

Checklist:

Uh oh!

Micky774 commented Mar 17, 2026

Uh oh!

Micky774 commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Micky774 commented Mar 16, 2026 •

edited

Loading

matthiasdiener Apr 15, 2026 •

edited

Loading