[docs] Quantization + torch.compile + offloading by stevhliu · Pull Request #11703 · huggingface/diffusers

stevhliu · 2025-06-12T22:55:00Z

Follows up on #11670 and #11672 to document combinations of quantization, torch.compile, and offloading.

HuggingFaceDocBuilderDev · 2025-06-12T23:01:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul

Thanks for starting this. Will get you the numbers.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-14T02:01:12Z

@stevhliu

combination	latency	memory usage
quantization	32.602	14.9453
quantization, torch.compile	25.847	14.9448
quantization, torch.compile, model CPU offloading	32.312	12.2369
quantization, torch.compile, group offloading	60.235	12.2369

Code: https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d

Worth mentioning:

We are applying quantization to transformer and text_encoder_2
GPU used: RTX 4090
Using PyTorch nightlies is better
https://huggingface.co/docs/diffusers/main/en/optimization/memory#group-offloading mentions why the speed-memory trade-off with group offloading in Flux isn't as expected.

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-17T03:36:08Z

docs/source/en/optimization/speed-memory-optims.md

+```
+
+</hfoption>
+<hfoption id="group offloading">


Do you think it might be better demonstrated with a more compute heavy model like Wan? This way, we can show the actual benefits of group offloading.

Sounds good, could you get me the updated numbers for Wan with quantization/group offloading/torch.compile please?

I think it's okay to have the Flux numbers but for the sake of code and discussions, we could have Wan.

Ah ok, don't worry about getting the Wan numbers then!

stevhliu · 2025-06-17T20:41:06Z

docs/source/en/optimization/memory.md

+
+Offloading strategies move not currently active layers or models to the CPU to avoid increasing GPU memory. These strategies can be combined with quantization and torch.compile to balance inference speed and memory usage.
+
+Refer to the [Compile and offloading quantized models](./speed-memory-optims) guide for more details.


I think #11731 can be resolved in this PR where I make a note that offloading can be combined with quantization and torch.compile

Added your layerwise casting note in here as well :)

Yeah feel free to close those :)

docs/source/en/optimization/speed-memory-optims.md

docs/source/en/optimization/memory.md

docs/source/en/optimization/speed-memory-optims.md

sayakpaul · 2025-06-18T01:15:16Z

docs/source/en/optimization/speed-memory-optims.md

+pipeline.transformer.enable_group_offload(
+    onload_device=onload_device,
+    offload_device=offload_device,
+    offload_type="block_level",
+    num_blocks_per_group=4
+)


We should use these args when a component is quantized with bitsandbytes to mitigate device mismatch issues:

https://github.com/huggingface/diffusers/blob/1bc6f3dc0f21779480db70a4928d14282c0198ed/tests/quantization/test_torch_compile_utils.py#L71C9-L77C10

But I am curious. Were you able to run the code?

It did not haha 🙃

sayakpaul

Left some more comments. LMK if they make sense.

sayakpaul

Let' go!

docs/source/en/optimization/speed-memory-optims.md

stevhliu commented Jun 12, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

stevhliu requested a review from sayakpaul June 12, 2025 23:12

sayakpaul reviewed Jun 13, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 13, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

stevhliu force-pushed the combine-optims branch from cdfd845 to 7d7f274 Compare June 13, 2025 22:07

stevhliu marked this pull request as ready for review June 16, 2025 19:23

sayakpaul reviewed Jun 17, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 17, 2025

View reviewed changes

stevhliu force-pushed the combine-optims branch from 4feb4d6 to b483f24 Compare June 17, 2025 20:39

stevhliu commented Jun 17, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/memory.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/memory.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jun 18, 2025

View reviewed changes

sayakpaul approved these changes Jun 19, 2025

View reviewed changes

docs/source/en/optimization/speed-memory-optims.md Show resolved Hide resolved

docs/source/en/optimization/speed-memory-optims.md Show resolved Hide resolved

stevhliu added 9 commits June 20, 2025 09:27

draft

2ee265b

feedback

5932619

update

e09afc0

feedback

c526bc6

fix

7930ff2

feedback

a0a16db

feedback

926df99

fix

43b269d

feedback

1089c80

stevhliu force-pushed the combine-optims branch from f78f0f5 to 1089c80 Compare June 20, 2025 16:42

stevhliu merged commit 5a6e386 into huggingface:main Jun 20, 2025
1 check passed

stevhliu deleted the combine-optims branch June 20, 2025 17:11

stevhliu mentioned this pull request Jun 20, 2025

[docs] add a tip for compile + offload #11731

Closed


		Offloading strategies move not currently active layers or models to the CPU to avoid increasing GPU memory. These strategies can be combined with quantization and torch.compile to balance inference speed and memory usage.

		Refer to the [Compile and offloading quantized models](./speed-memory-optims) guide for more details.

Conversation

stevhliu commented Jun 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2025

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Jun 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants