Skip to content

ASV-format microbenchmark suite#487

Open
Micky774 wants to merge 19 commits intodevfrom
zain/asv-demo
Open

ASV-format microbenchmark suite#487
Micky774 wants to merge 19 commits intodevfrom
zain/asv-demo

Conversation

@Micky774
Copy link
Copy Markdown
Contributor

@Micky774 Micky774 commented Mar 16, 2026

Description

This PR is a port of #478.

This PR uses a central driver to parse and run individual benchmark-defining scripts. The driver provides a function that can be imported and used by the individual scripts to make them self-sufficient and runnable. The benchmarks themselves, and the driver, have no hard ASV dependency. Instead, they simply produce results in an ASV-compatible format for later consumption.

ASV is only used for result tracking, visualization, and publishing. A helper bash script is provided to wrap the ASV commands for convenience (as well as offering a wrapper on the main driver script).

Follow-up Work

In future PRs we will:

  • extend benchmarking to new ops
  • re-evaluate bench configs and scope
  • update attention benchmarking to reach parity with the JAX FA benchmarking tool (mainly so we have persistent regression tracking)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Adds benchmarks
  • Adds README.md for documentation
  • Adds driver script
  • Adds helper bash script to wrap driver and ASV

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Micky774 Micky774 marked this pull request as ready for review March 17, 2026 13:58
@Micky774 Micky774 mentioned this pull request Mar 17, 2026
16 tasks
@Micky774
Copy link
Copy Markdown
Contributor Author

Note the CI failure is unrelated

@Micky774
Copy link
Copy Markdown
Contributor Author

I've added a helper script like @alextmagro had suggested, as well as corresponding documentation to the README.md.

Comment thread .github/workflows/rocm-ci.yml Outdated
EOF
)"

- name: Restore previous ASV results
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think benchmarks should go separate workflow from CI. I.e. these microbenchmarks and ones that are already run with CI

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will doing so require a separate TE build and setup? I added it here so that we'd piggy-back off of already running CI.

Comment thread benchmarks/asv/asv.conf.json
Comment thread .github/workflows/rocm-ci.yml Outdated

# Derive a stable machine name from the runner label
case "${RUNNER_NAME}" in
linux-te-mi325*) MACHINE_NAME="mi325" ;;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it if results are uploaded with just matrix.runner name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my understanding is that the matrix.runner name is not 1-1 with the underlying system, i.e. different systems with different machine names can be part of a pool with the same runner name. ASV by default stores results by machine name. Here, we are manually specifying a generic machine name indexed by gpu arch so that each e.g. mi325 runner will store its results in a compatible way.

Ideally, we have dedicated machines for benchmarking (since this would likely be every commit or nightly even), but that's a constraint we'll need to discuss.

Comment thread .github/workflows/rocm-ci.yml Outdated
set -ex
pip install asv
cd /workspace
asv machine --yes --machine "$MACHINE_NAME"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it re-register machine if it exists already?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's registered in the container so it's transient

@Micky774 Micky774 changed the title ASV demo ASV-format microbenchmark suite Mar 24, 2026
# Helper script for common ASV benchmark tasks.
set -euo pipefail

cd "$(git rev-parse --show-toplevel)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) On CI it may fail with dubious ownership error. (2) If current directory is not a part of git tree, it fails too. So it is better to determine BENCH_DIR as directory where current script is located

Comment thread benchmarks/asv/run_benchmarks.sh Outdated

case "${1:-}" in
setup)
MACHINE="${2:-$(hostname)}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are hostnames used for CI runners persistent?

Comment thread benchmarks/asv/driver.py Outdated
Comment on lines +341 to +344
parser.add_argument("-w", "--warmup", type=int, default=3,
help="Number of warmup iterations (default: 3)")
parser.add_argument("-n", "--iters", type=int, default=7,
help="Number of timed iterations (default: 7)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3/7 iterations is quite low, for microbenchmarks we could probably use much higher numbers (like 50/50) by default.

As an example, for the second shape in bench_gemm I sometimes get 0.127ms as the median value with 3/7, while 50/50 is pretty consistently at 0.111ms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread benchmarks/asv/README.md Outdated
|---|---|
| `setup [name]` | Register machine with ASV (defaults to `hostname`) |
| `run [suite] [method]` | Run benchmarks in-process (fast, saves ASV-compatible results) |
| `run --asv [suite]` | Run via ASV subprocess isolation (for CI or statistical rigor) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do keep the functionality to run benchmarks with asv, can we make this a driver.py option?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually now trimmed it so that we only run directly, simplifying the whole thing a bit. ASV is used strictly for publishing/viewing and regression tracking.

Comment thread benchmarks/asv/driver.py

# Derive throughput from work_* companion
work = {}
wfn = getattr(instance, "work_" + method_name[5:], None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the "work"/tflops values stored anywhere? I could only find them printed on stdout via the direct run, but not stored. The work_ methods seems to be unused for asv.

Comment thread benchmarks/asv/run_benchmarks.sh Outdated
setup Register this machine with ASV
run [-w W] [-n N] [SUITE] [METHOD]
Run benchmarks in-process (fast, saves ASV-compatible results)
run --asv [SUITE] Run benchmarks via ASV (subprocess isolation per benchmark)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results seem to differ significantly between running with asv and running directly. E.g., the first three shapes in bench_gemm.py return:

[62.50%] ··· bench_gemm.BenchGemm.time_forward                                                                          ok
[62.50%] ··· ====== ========================= =============
               M              shape
             ------ ------------------------- -------------
              1024      Llama3-8B_TP1-QKV       160±0.4μs
              1024    Llama3-8B_TP1-AttnOut      129±1μs
              1024     Llama3-8B_TP1-GateUp      460±1μs

with asv, and

----------------------------------------------------------------------------------------------------------------------------------------
      median        mean       stdev         q25         q75         min         max      TFLOPS  method                          params
----------------------------------------------------------------------------------------------------------------------------------------
     0.147ms     0.147ms     0.002ms     0.146ms     0.148ms     0.145ms     0.151ms       350.2  time_forward                    M=1024, shape=Llama3-8B_TP1-QKV
     0.120ms     0.120ms     0.007ms     0.112ms     0.130ms     0.110ms     0.130ms       287.1  time_forward                    M=1024, shape=Llama3-8B_TP1-AttnOut
     0.405ms     0.406ms     0.008ms     0.397ms     0.410ms     0.395ms     0.422ms       594.6  time_forward                    M=1024, shape=Llama3-8B_TP1-GateUp

with the direct method.

@@ -0,0 +1,97 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new attention microbenchmark, should we use the attention microbenchmark(s) already part of TE (in https://github.com/ROCm/TransformerEngine/tree/dev/benchmarks/attention)?

Comment thread benchmarks/asv/driver.py Outdated
"""Return (median, mean, stdev, ci_lo, ci_hi, q25, q75) for *samples*."""
s = sorted(samples)
n = len(s)
mean = sum(s) / n
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use statistics.median() etc. here? The median calculated here (s[n//2]) is probably not correct, for even numbers of elements it should be the average of the middle elements.

A similar issue exists for stdev, I think ( /n vs. /(n-1)).

Comment thread benchmarks/asv/bench_attention.py Outdated
Comment on lines +28 to +29
Forward FLOPs = 4 * batch * num_q_heads * seq_len^2 * head_dim
Backward FLOPs ~ 2x forward
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated from lines 17-19 in this file.

Comment thread benchmarks/asv/driver.py
"stats_ci_99_a", "stats_ci_99_b",
"stats_q_25", "stats_q_75",
"stats_number", "stats_repeat",
"samples",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of samples here? Is it ever written?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants