Avoid DtoH sync from access of nonzero() item in scheduler#11696
Avoid DtoH sync from access of nonzero() item in scheduler#11696yiyixuxu merged 1 commit intohuggingface:mainfrom
Conversation
a-r-r-o-w
left a comment
There was a problem hiding this comment.
Thanks, this is a cool discovery! We discussed in the past about the sync that occurs for first step and it's due to the timestep lookup to figure out the starting index. For text-to-X pipelines, the self.scheduler.set_begin_index(0) solution looks good and probably something we can propagate to all pipelines. For other pipelines that support things like denoising strength (for example, img-to-img/vid-to-vid task), or more specifically any model that starts denoising at a custom timestep index instead of 0, we will require doing the lookup though (atleast with current design).
|
custom timesteps uses |
|
yes agree we should apply this change to all text2image pipelines, test-wise, as long as the current test pass we are fine, I think this fix is sufficient for all pipelines though, if not, we can look at them case by case |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@yiyixuxu I think there are many instances where we don't do it the same way as done in SDXL. These will incur the extra cost from |
|
@a-r-r-o-w yeah, i know, we can fix them though. I think unless you need to look up on every step, it should work by setting a index in the beginning |
|
@jbschlosser @a-r-r-o-w |
|
@yiyixuxu thank you very much for the treatment of this PR. Appreciate that. |
What does this PR do?
(discussed with @sayakpaul in Slack) I have been optimizing inference performance for the Flux model and saw a DtoH sync point in the performance trace coming from the use of nonzero() followed by item() in the scheduler's
index_for_timestep()logic:diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py
Lines 351 to 363 in b0f7036
This sync causes a minor but non-negligible gap in GPU utilization during the first timestep, especially when torch.compile is utilized (due to cpu-side Dynamo cache lookup overhead after the sync point), as shown here:
AFAICT this can be avoided by manually calling
scheduler.set_begin_index(0), so this PR does that forFluxPipeline. Locally, I was able to see the sync point go away after this change.Insights welcome regarding:
FluxPipeline?