Conversation
|
Note the CI failure is unrelated |
|
I've added a helper script like @alextmagro had suggested, as well as corresponding documentation to the |
| EOF | ||
| )" | ||
|
|
||
| - name: Restore previous ASV results |
There was a problem hiding this comment.
I think benchmarks should go separate workflow from CI. I.e. these microbenchmarks and ones that are already run with CI
There was a problem hiding this comment.
Will doing so require a separate TE build and setup? I added it here so that we'd piggy-back off of already running CI.
|
|
||
| # Derive a stable machine name from the runner label | ||
| case "${RUNNER_NAME}" in | ||
| linux-te-mi325*) MACHINE_NAME="mi325" ;; |
There was a problem hiding this comment.
Why do we need it if results are uploaded with just matrix.runner name?
There was a problem hiding this comment.
So, my understanding is that the matrix.runner name is not 1-1 with the underlying system, i.e. different systems with different machine names can be part of a pool with the same runner name. ASV by default stores results by machine name. Here, we are manually specifying a generic machine name indexed by gpu arch so that each e.g. mi325 runner will store its results in a compatible way.
Ideally, we have dedicated machines for benchmarking (since this would likely be every commit or nightly even), but that's a constraint we'll need to discuss.
| set -ex | ||
| pip install asv | ||
| cd /workspace | ||
| asv machine --yes --machine "$MACHINE_NAME" |
There was a problem hiding this comment.
Will it re-register machine if it exists already?
There was a problem hiding this comment.
Yes, but it's registered in the container so it's transient
| # Helper script for common ASV benchmark tasks. | ||
| set -euo pipefail | ||
|
|
||
| cd "$(git rev-parse --show-toplevel)" |
There was a problem hiding this comment.
(1) On CI it may fail with dubious ownership error. (2) If current directory is not a part of git tree, it fails too. So it is better to determine BENCH_DIR as directory where current script is located
|
|
||
| case "${1:-}" in | ||
| setup) | ||
| MACHINE="${2:-$(hostname)}" |
There was a problem hiding this comment.
Are hostnames used for CI runners persistent?
| parser.add_argument("-w", "--warmup", type=int, default=3, | ||
| help="Number of warmup iterations (default: 3)") | ||
| parser.add_argument("-n", "--iters", type=int, default=7, | ||
| help="Number of timed iterations (default: 7)") |
There was a problem hiding this comment.
3/7 iterations is quite low, for microbenchmarks we could probably use much higher numbers (like 50/50) by default.
As an example, for the second shape in bench_gemm I sometimes get 0.127ms as the median value with 3/7, while 50/50 is pretty consistently at 0.111ms.
| |---|---| | ||
| | `setup [name]` | Register machine with ASV (defaults to `hostname`) | | ||
| | `run [suite] [method]` | Run benchmarks in-process (fast, saves ASV-compatible results) | | ||
| | `run --asv [suite]` | Run via ASV subprocess isolation (for CI or statistical rigor) | |
There was a problem hiding this comment.
If we do keep the functionality to run benchmarks with asv, can we make this a driver.py option?
There was a problem hiding this comment.
I've actually now trimmed it so that we only run directly, simplifying the whole thing a bit. ASV is used strictly for publishing/viewing and regression tracking.
|
|
||
| # Derive throughput from work_* companion | ||
| work = {} | ||
| wfn = getattr(instance, "work_" + method_name[5:], None) |
There was a problem hiding this comment.
Are the "work"/tflops values stored anywhere? I could only find them printed on stdout via the direct run, but not stored. The work_ methods seems to be unused for asv.
| setup Register this machine with ASV | ||
| run [-w W] [-n N] [SUITE] [METHOD] | ||
| Run benchmarks in-process (fast, saves ASV-compatible results) | ||
| run --asv [SUITE] Run benchmarks via ASV (subprocess isolation per benchmark) |
There was a problem hiding this comment.
The results seem to differ significantly between running with asv and running directly. E.g., the first three shapes in bench_gemm.py return:
[62.50%] ··· bench_gemm.BenchGemm.time_forward ok
[62.50%] ··· ====== ========================= =============
M shape
------ ------------------------- -------------
1024 Llama3-8B_TP1-QKV 160±0.4μs
1024 Llama3-8B_TP1-AttnOut 129±1μs
1024 Llama3-8B_TP1-GateUp 460±1μs
with asv, and
----------------------------------------------------------------------------------------------------------------------------------------
median mean stdev q25 q75 min max TFLOPS method params
----------------------------------------------------------------------------------------------------------------------------------------
0.147ms 0.147ms 0.002ms 0.146ms 0.148ms 0.145ms 0.151ms 350.2 time_forward M=1024, shape=Llama3-8B_TP1-QKV
0.120ms 0.120ms 0.007ms 0.112ms 0.130ms 0.110ms 0.130ms 287.1 time_forward M=1024, shape=Llama3-8B_TP1-AttnOut
0.405ms 0.406ms 0.008ms 0.397ms 0.410ms 0.395ms 0.422ms 594.6 time_forward M=1024, shape=Llama3-8B_TP1-GateUp
with the direct method.
| @@ -0,0 +1,97 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Instead of creating a new attention microbenchmark, should we use the attention microbenchmark(s) already part of TE (in https://github.com/ROCm/TransformerEngine/tree/dev/benchmarks/attention)?
| """Return (median, mean, stdev, ci_lo, ci_hi, q25, q75) for *samples*.""" | ||
| s = sorted(samples) | ||
| n = len(s) | ||
| mean = sum(s) / n |
There was a problem hiding this comment.
Why not use statistics.median() etc. here? The median calculated here (s[n//2]) is probably not correct, for even numbers of elements it should be the average of the middle elements.
A similar issue exists for stdev, I think ( /n vs. /(n-1)).
| Forward FLOPs = 4 * batch * num_q_heads * seq_len^2 * head_dim | ||
| Backward FLOPs ~ 2x forward |
There was a problem hiding this comment.
Repeated from lines 17-19 in this file.
| "stats_ci_99_a", "stats_ci_99_b", | ||
| "stats_q_25", "stats_q_75", | ||
| "stats_number", "stats_repeat", | ||
| "samples", |
There was a problem hiding this comment.
What's the meaning of samples here? Is it ever written?
Description
This PR is a port of #478.
This PR uses a central driver to parse and run individual benchmark-defining scripts. The driver provides a function that can be imported and used by the individual scripts to make them self-sufficient and runnable. The benchmarks themselves, and the driver, have no hard ASV dependency. Instead, they simply produce results in an ASV-compatible format for later consumption.
ASV is only used for result tracking, visualization, and publishing. A helper bash script is provided to wrap the ASV commands for convenience (as well as offering a wrapper on the main driver script).
Follow-up Work
In future PRs we will:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: