Skip to content

PAG variant for AnimateDiff#8789

Merged
DN6 merged 21 commits into
huggingface:mainfrom
a-r-r-o-w:animatediff/pag
Aug 1, 2024
Merged

PAG variant for AnimateDiff#8789
DN6 merged 21 commits into
huggingface:mainfrom
a-r-r-o-w:animatediff/pag

Conversation

@a-r-r-o-w
Copy link
Copy Markdown
Contributor

@a-r-r-o-w a-r-r-o-w commented Jul 4, 2024

What does this PR do?

Looking at #8710, I thought it might interesting to apply PAG to video generation pipelines, and see if there's interest in supporting this.

In addition to this, I would also like to propose the addition of AutoPipelineForTextToVideo since we support a few video models now, and this will continue to grow with ongoing research progress. WDYT?

Code
import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, DDIMScheduler
from diffusers.pipelines.pag.pipeline_pag_sd_animatediff import AnimateDiffPAGPipeline
from diffusers.utils import export_to_gif


# model_id = "runwayml/stable-diffusion-v1-5"
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-2"

prompt = "car, futuristic cityscape with neon lights, street, no human"
negative_prompt = "low quality, bad quality"
num_inference_steps = 25
guidance_scale = 6
pag_scale = 3.0

motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id)
scheduler = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler", beta_schedule="linear", steps_offset=1, clip_sample=False)
pipe = AnimateDiffPAGPipeline.from_pretrained(
    model_id,
    motion_adapter=motion_adapter,
    scheduler=scheduler,
    pag_applied_layers=[],
    torch_dtype=torch.float16,
).to("cuda")

configs = [
    dict(pag_scale=0.0, clip_skip=None, free_init=False),
    dict(pag_scale=0.0, clip_skip=2, free_init=False),
    dict(pag_scale=3.0, clip_skip=None, free_init=False),
    dict(pag_scale=3.0, clip_skip=2, free_init=False),
    dict(pag_scale=3.0, clip_skip=None, free_init=True),
    dict(pag_scale=3.0, clip_skip=2, free_init=True),
    dict(pag_scale=0.5, clip_skip=None, free_init=False),
    dict(pag_scale=0.5, clip_skip=2, free_init=False),
    dict(pag_scale=0.5, clip_skip=None, free_init=True),
    dict(pag_scale=0.5, clip_skip=2, free_init=True),
    dict(pag_scale=5.0, clip_skip=None, free_init=False),
    dict(pag_scale=5.0, clip_skip=2, free_init=False),
    dict(pag_scale=5.0, clip_skip=None, free_init=True),
    dict(pag_scale=5.0, clip_skip=2, free_init=True),
]

for config in configs:
    free_init = config.pop("free_init", False)
    if free_init:
        pipe.enable_free_init(method="butterworth", use_fast_sampling=True)
    
    video = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=512,
        width=512,
        num_frames=16,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=torch.Generator().manual_seed(42),
        **config,
    ).frames[0]

    if free_init:
        pipe.disable_free_init()

    export_to_gif(video, f"animatediff_pag-{config['pag_scale']}_clipskip-{config['clip_skip']}_freeinit-{config['free_init']}.gif")

motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id)
scheduler = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler", beta_schedule="linear", steps_offset=1, clip_sample=False)
pipe = AnimateDiffPAGPipeline.from_pretrained(
    model_id,
    motion_adapter=motion_adapter,
    scheduler=scheduler,
    pag_applied_layers=["mid"],
    torch_dtype=torch.float16,
).to("cuda")

for config in configs:
    free_init = config.pop("free_init", False)
    if free_init:
        pipe.enable_free_init(method="butterworth", use_fast_sampling=True)
    
    video = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=512,
        width=512,
        num_frames=16,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=torch.Generator().manual_seed(42),
        **config,
    ).frames[0]

    if free_init:
        pipe.disable_free_init()

    export_to_gif(video, f"animatediff_pag-mid-{config['pag_scale']}_clipskip-{config['clip_skip']}_freeinit-{config['free_init']}.gif")
pag 0, clip_skip None, free_init False pag 0, clip_skip 2, free_init False
pag 3, clip_skip None, free_init False pag 3, clip_skip 2, free_init False
pag 3, clip_skip None, free_init True pag 3, clip_skip 2, free_init True

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu @asomoza @DN6

@a-r-r-o-w a-r-r-o-w changed the title Experimenting with PAG variant for AnimateDiff PAG variant for AnimateDiff Jul 4, 2024
@asomoza
Copy link
Copy Markdown
Member

asomoza commented Jul 4, 2024

Nice! I'll do some tests. It looks good.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good to me. Could we add some tests and I think there are some issues with the copied from statement. I think you'd just need to run make fix-copies

@DN6 DN6 requested a review from yiyixuxu July 5, 2024 07:28
Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! super cool!

Comment thread src/diffusers/models/attention_processor.py Outdated
in_channels=out_channels,
num_layers=temporal_transformer_layers_per_block[i],
norm_num_groups=temporal_norm_num_groups,
norm_num_groups=resnet_groups,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for why we need this change: #7707 (comment). It is the correct thing to do here. cc @DN6

return f"attentions_{module_name.split('.')[3]}"
elif "attentions" in module_name.split(".")[1]:
return f"attentions_{module_name.split('.')[2]}"
# down_blocks.1.motion_modules.0.transformer_blocks.0.attn1 -> "motion_modules_0"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to support motion modules self attention layers as well in PAGMixin. cc @yiyixuxu

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# down_blocks.1.attentions.0.transformer_blocks.0.attn1 -> "block_1"
# mid_block.attentions.0.transformer_blocks.0.attn1 -> "block_0"
if "attentions" in module_name.split(".")[1]:
module_name_splits = module_name.split(".")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a little bit of a refactor as well to make all functions look similar-ish to get_attn_index. Can open a separate PR if this is out of scope. cc yiyixuxu

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to include here!

}
return inputs

def test_from_pipe_consistent_config(self):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry :(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably because we didn't handle the deprecated unet config in __init__
related here: #7564

because this pipeline shares the same checkpoints with sd1.5, technically you need to handle the deprecation too even though it is a new pipeline

if hasattr(scheduler.config, "steps_offset") and scheduler.config.steps_offset != 1:

however in practise, I think maybe no one will use these really old checkpoints on animate diff, so I'm ok to skip the test here

Comment on lines +54 to +69
def get_dummy_components(self):
cross_attention_dim = 8
block_out_channels = (8, 8)

torch.manual_seed(0)
unet = UNet2DConditionModel(
block_out_channels=block_out_channels,
layers_per_block=1,
sample_size=8,
in_channels=4,
out_channels=4,
down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
cross_attention_dim=cross_attention_dim,
norm_num_groups=2,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DN6 Might be of interest to you for AnimateDiff 👀 Makes the tests much faster!

For test_animatediff.py:

real    2m38,020s
user    17m37,497s
sys     1m59,702s

For PAG AnimateDiff with new dummy component sizes:

real    1m36,838s
user    1m43,263s
sys     0m29,261s

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

After giving some more thought and observing the outputs, I feel the perturbed path for motion model seems to have either a lesser impact or negative impact on the generations. PAG works great for spatial self attention layers, that of text to image models, as we've seen from many other pipelines, however I think more experiments are needed when dealing with temporal layers like the attn in motion model. Will report back with some experiments soon.

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

Btw if we're okay with merging this for the generation improvements with just the spatial attn PAG processors, we can go ahead with that. Can continue experimenting with the motion models separately and revert the pag_utils changes here.

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

# down_blocks.1.attentions.0.transformer_blocks.0.attn1 -> "block_1"
# mid_block.attentions.0.transformer_blocks.0.attn1 -> "block_0"
if "attentions" in module_name.split(".")[1]:
module_name_splits = module_name.split(".")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to include here!

return f"attentions_{module_name.split('.')[3]}"
elif "attentions" in module_name.split(".")[1]:
return f"attentions_{module_name.split('.')[2]}"
# down_blocks.1.motion_modules.0.transformer_blocks.0.attn1 -> "motion_modules_0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
return inputs

def test_from_pipe_consistent_config(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably because we didn't handle the deprecated unet config in __init__
related here: #7564

because this pipeline shares the same checkpoints with sd1.5, technically you need to handle the deprecation too even though it is a new pipeline

if hasattr(scheduler.config, "steps_offset") and scheduler.config.steps_offset != 1:

however in practise, I think maybe no one will use these really old checkpoints on animate diff, so I'm ok to skip the test here

@yiyixuxu
Copy link
Copy Markdown
Collaborator

Btw if we're okay with merging this for the generation improvements with just the spatial attn PAG processors, we can go ahead with that. Can continue experimenting with the motion models separately and revert the pag_utils changes here.

what do you mean here? do you mean applying PAG only on the motion modules does not generate good results? e.g. pag_applied_layers = ["down.block_1.motion_modules_0"], if you do something like "mid", it will apply on both temporal and spatial, no? would the results be same as spatial only? can we see some results

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

what do you mean here? do you mean applying PAG only on the motion modules does not generate good results? e.g. pag_applied_layers = ["down.block_1.motion_modules_0"], if you do something like "mid", it will apply on both temporal and spatial, no? would the results be same as spatial only? can we see some results

Thanks for the review! So, I added the changes for PAG to work with motion modules self attn (temporal layers) later in this commit). The results posted in the PR description utilize PAG in only the spatial layers. After my latest changes, I prefer the outputs of just spatial as opposed to spatial and temporal PAG from the few quick experiments I tried. I will soon look into it more thoroughly and report back. cc @asomoza

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

a-r-r-o-w commented Jul 16, 2024

I took a quick look at the Comfy implementation earlier today to see what Kosinkadink was using: https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved/blob/4dd592e9fce9ac59edadee40cf4d2069165dc226/animatediff/cfg_extras.py#L59

It is being applied to both spatial and temporal attn1 layers, which is what we have at the moment too, so I'm okay to roll with this and prepare for merge. But, I think we can investigate a bit further eventually into the dynamics of PAG with temporal layers (perhaps community issue with advanced label @yiyixuxu?). Also found unofficial implementations of the latest PAGMixin being used in different ways which makes this interesting (example: https://github.com/pixeli99/Spatio-Temporal-Shuffle-Guidance; demos in README).

@yiyixuxu
Copy link
Copy Markdown
Collaborator

The results posted in the PR description utilize PAG in only the spatial layers

I think in the PR description code, you used pag_applied_layers="mid" - that means it's applying to both spatial and temporal layers, no?

the commit you added here in this commit allows you to apply PAG to ONLY temporal layers, that does not yield good results - do I understand this correctly?

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

a-r-r-o-w commented Jul 16, 2024

I think in the PR description code, you used pag_applied_layers="mid" - that means it's applying to both spatial and temporal layers, no?

Correct. With the latest commits, using pag_applied_layers="mid" will apply it to both spatial and temporal midblock layer.

What I'm trying to point out is that the demos that you see in the description are the ones with only spatial PAG because I had not implemented it for motion models yet then.

the commit you added here in this commit allows you to apply PAG to ONLY temporal layers, that does not yield good results - do I understand this correctly?

Nope. Because of that commit, PAG applies to BOTH spatial and temporal layers. I preferred the output of what was before that commit (that is ONLY spatial). Hope it makes sense now 😅

TLDR;

  • ONLY Spatial PAG -> Good (the results posted in this PR in description)
  • Spatial and Temporal PAG -> I prefer the quality of ONLY spatial a little better but done limited testing so far. This is the current behaviour if you clone this branch. Okay to roll with it because ComfyUI behaviour is similar.
  • ONLY Temporal PAG -> Didn't try this and did not mention about it.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

you can see that here

for name, module in self.unet.named_modules():

when pass pag_applied_layers="mid", it apply PAG to all self-attentions with name starting with "mid", that will include both spatial and temporal, with or without the commit

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Jul 16, 2024

ohh i see now, we needed the change applied to get_block_index

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

a-r-r-o-w commented Jul 16, 2024

reply to #8789 (comment)

maybe that is correct, but when I was debugging, I did not notice any of the motion module self attn layers being set, and that was one reason for adding my changes. The other reason being more fine grained control to do things like mid_block_0.motion_modules.... Let me try reverting my changes and verifying why they were needed first thing I'm awake next.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

ok, we can merge this PR as it is consistent with comfy

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

Looking good to merge from my end after we merge #8846 since this change is out of scope of the PR.

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

Fixed the broken tests here. This looks good to merge after #8846 which is now complete too

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu July 28, 2024 11:14
@DN6 DN6 merged commit 05b706c into huggingface:main Aug 1, 2024
@a-r-r-o-w a-r-r-o-w deleted the animatediff/pag branch August 1, 2024 07:19
@yiyixuxu yiyixuxu added the PAG label Sep 4, 2024
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
* add animatediff pag pipeline

* remove unnecessary print

* make fix-copies

* fix ip-adapter bug

* update docs

* add fast tests and fix bugs

* update

* update

* address review comments

* update ip adapter single test expected slice

* implement test_from_pipe_consistent_config; fix expected slice values

* LoraLoaderMixin->StableDiffusionLoraLoaderMixin; add latest freeinit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants