Avoid DtoH sync from access of nonzero() item in scheduler by jbschlosser · Pull Request #11696 · huggingface/diffusers

jbschlosser · 2025-06-11T18:24:13Z

What does this PR do?

(discussed with @sayakpaul in Slack) I have been optimizing inference performance for the Flux model and saw a DtoH sync point in the performance trace coming from the use of nonzero() followed by item() in the scheduler's index_for_timestep() logic:

diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py

Lines 351 to 363 in b0f7036

    
           def index_for_timestep(self, timestep, schedule_timesteps=None): 
        
               if schedule_timesteps is None: 
        
                   schedule_timesteps = self.timesteps 
        
               indices = (schedule_timesteps == timestep).nonzero() 
        
               # The sigma index that is taken for the **very** first `step` 
        
               # is always the second index (or the last index if there is only 1) 
        
               # This way we can ensure we don't accidentally skip a sigma in 
        
               # case we start in the middle of the denoising schedule (e.g. for image-to-image) 
        
               pos = 1 if len(indices) > 1 else 0 
        
               return indices[pos].item()

This sync causes a minor but non-negligible gap in GPU utilization during the first timestep, especially when torch.compile is utilized (due to cpu-side Dynamo cache lookup overhead after the sync point), as shown here:

AFAICT this can be avoided by manually calling scheduler.set_begin_index(0), so this PR does that for FluxPipeline. Locally, I was able to see the sync point go away after this change.

Insights welcome regarding:

A more robust alternative fix
The best way to test this change (looks like test_pipeline_flux.py is a good place?)
Should this fix be generalized beyond FluxPipeline?

a-r-r-o-w

Thanks, this is a cool discovery! We discussed in the past about the sync that occurs for first step and it's due to the timestep lookup to figure out the starting index. For text-to-X pipelines, the self.scheduler.set_begin_index(0) solution looks good and probably something we can propagate to all pipelines. For other pipelines that support things like denoising strength (for example, img-to-img/vid-to-vid task), or more specifically any model that starts denoising at a custom timestep index instead of 0, we will require doing the lookup though (atleast with current design).

yiyixuxu · 2025-06-11T21:27:50Z

@a-r-r-o-w

custom timesteps uses set_begin_index too

diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py

Line 659 in f3e0911

self.scheduler.set_begin_index(t_start * self.scheduler.order)

yiyixuxu

thanks!

yiyixuxu · 2025-06-11T21:32:15Z

yes agree we should apply this change to all text2image pipelines, test-wise, as long as the current test pass we are fine,
unless we want to add additional torch compile related test

I think this fix is sufficient for all pipelines though, if not, we can look at them case by case

HuggingFaceDocBuilderDev · 2025-06-11T21:35:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2025-06-11T21:36:37Z

@yiyixuxu I think there are many instances where we don't do it the same way as done in SDXL. These will incur the extra cost from nonzero().

diffusers/src/diffusers/pipelines/wan/pipeline_wan_video2video.py

Line 446 in f3e0911

def get_timesteps(self, num_inference_steps, timesteps, strength, device):

yiyixuxu · 2025-06-11T22:01:44Z

@a-r-r-o-w yeah, i know, we can fix them though. I think unless you need to look up on every step, it should work by setting a index in the beginning

yiyixuxu · 2025-06-11T22:03:15Z

@jbschlosser @a-r-r-o-w
I'm merging this now as I understand this is needed pretty urgently
we can do a follow-up PR for the rest of pipelines :)

sayakpaul · 2025-06-12T00:52:23Z

@yiyixuxu thank you very much for the treatment of this PR. Appreciate that.

Avoid DtoH sync from access of nonzero() item in scheduler

7ddfa19

a-r-r-o-w requested a review from yiyixuxu June 11, 2025 18:28

a-r-r-o-w reviewed Jun 11, 2025

View reviewed changes

yiyixuxu approved these changes Jun 11, 2025

View reviewed changes

yiyixuxu merged commit b272807 into huggingface:main Jun 11, 2025
12 checks passed

sayakpaul mentioned this pull request Jun 27, 2025

remove syncs before denoising in Kontext #11818

Merged

bghira mentioned this pull request Feb 1, 2026

set timestep index to 0 to avoid lookup in scheduler.step bghira/SimpleTuner#2554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid DtoH sync from access of nonzero() item in scheduler#11696

Avoid DtoH sync from access of nonzero() item in scheduler#11696
yiyixuxu merged 1 commit intohuggingface:mainfrom
jbschlosser:main

jbschlosser commented Jun 11, 2025 •

edited

Loading

Uh oh!

a-r-r-o-w left a comment

Uh oh!

yiyixuxu commented Jun 11, 2025 •

edited

Loading

Uh oh!

yiyixuxu left a comment

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

a-r-r-o-w commented Jun 11, 2025 •

edited

Loading

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

Uh oh!

sayakpaul commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	def index_for_timestep(self, timestep, schedule_timesteps=None):
	if schedule_timesteps is None:
	schedule_timesteps = self.timesteps

	indices = (schedule_timesteps == timestep).nonzero()

	# The sigma index that is taken for the very first `step`
	# is always the second index (or the last index if there is only 1)
	# This way we can ensure we don't accidentally skip a sigma in
	# case we start in the middle of the denoising schedule (e.g. for image-to-image)
	pos = 1 if len(indices) > 1 else 0

	return indices[pos].item()

Conversation

jbschlosser commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

a-r-r-o-w commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

yiyixuxu commented Jun 11, 2025

Uh oh!

Uh oh!

sayakpaul commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbschlosser commented Jun 11, 2025 •

edited

Loading

yiyixuxu commented Jun 11, 2025 •

edited

Loading

a-r-r-o-w commented Jun 11, 2025 •

edited

Loading