Skip to content

[WIP] Wan2.2 #12004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 28, 2025
Merged

[WIP] Wan2.2 #12004

merged 13 commits into from
Jul 28, 2025

Conversation

yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Jul 28, 2025

install from PR

pip install git+https://github.com/huggingface/diffusers.git@wan2.2

TI2V (only Text-to-image is supported for now, adding I2V soon)

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan, WanTransformer3DModel, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda"

model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0


prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "5bit2v_output.mp4", fps=24)

14B T2V

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda:2"
vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 720
width = 1280


prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=4.0,
    guidance_scale_2=3.0,
    num_inference_steps=40,
).frames[0]
export_to_video(output, "t2v_out.mp4", fps=16)

14B I2V

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.to(device)


image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG"
)
max_area = 480 * 832
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]
export_to_video(output, "i2v_output.mp4", fps=16)
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -656,6 +912,45 @@ def forward(self, x, feat_cache=None, feat_idx=[0]):
return x


def patchify(x, patch_size):
# YiYi TODO: refactor this
from einops import rearrange

This comment was marked as resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think it might work for newer versions of torch: https://github.com/arogozhnikov/einops/wiki/Using-torch.compile-with-einops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the insight!

@okaris
Copy link
Contributor

okaris commented Jul 28, 2025

@yiyixuxu thanks for releasing this so quickly! we are having some issues trying to get 5b i2v work. afai understand 5b is both for t2v and i2v. i tried a naive hack to copy the model.index.json of the 14b i2v but it didn't quite help.

@yiyixuxu
Copy link
Collaborator Author

@okaris 5b i2v is not supported yet - will look to add it today

@okaris
Copy link
Contributor

okaris commented Jul 28, 2025

@yiyixuxu thanks for the quick reply. happy to contribute if you can point me in the right direction.

@yiyixuxu yiyixuxu requested a review from a-r-r-o-w July 28, 2025 18:09
Co-authored-by: bagheera <59658056+bghira@users.noreply.github.com>
Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks YiYi! Just nits. Will add docs in follow-up as discussed. I think we should remove the changes to the test files here (Wan2.2 dual transformer should be tested separately instead of combining with Wan2.1 tests, such that both are fully tested).

@@ -34,6 +34,103 @@
CACHE_T = 2


class AvgDown3D(nn.Module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe prefix these classes with Wan to follow same naming convention

@@ -713,21 +1038,47 @@ def __init__(
2.8251,
1.9160,
],
is_residual: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now, but ideally, we should just make a separate AutoencoderKLWan2_2 because the structure and internal blocks is different and try to standardize having single-file implementations per model type, similar to transformers. All the if-branching makes things a little harder to reverse engineer and increases barrier for entry for someone wanting to look at the implementations for study purposes IMO.

shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
self.scale_shift_table + temb.float()
).chunk(6, dim=1)
if temb.ndim == 4:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as VAE, ideally this should be in separate transformer implementation, transformer_wan_2_2.py, if we want to adopt single file properly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

I think vae can have its own class, feel free to refactor it if you prefer!
transformer change is really minimum and we could refactor further so it only a single code path, i.e. we just need to always expand timesteps inputs to be 2d. ( I did not have time to test it out so I kept if else here)

@yiyixuxu yiyixuxu merged commit a6d9f6a into main Jul 28, 2025
14 of 15 checks passed
@yiyixuxu yiyixuxu deleted the wan2.2 branch July 29, 2025 01:49
@jingw193
Copy link

Hello, @yiyixuxu, I generated a video (https://github.com/user-attachments/assets/ce6ebaf1-8478-4c29-9170-57d5ae854a7d) using the code below and noticed a slight grainy texture. Is this expected behavior, and does it match the results you observed during your testing?

`import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan, WanTransformer3DModel, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda"

model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=height,
width=width,
num_frames=num_frames,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "5bit2v_output.mp4", fps=24)
`

@agneet42
Copy link

agneet42 commented Aug 2, 2025

Hi,
Thanks for the great work as always.
I wanted to understand if Wan 2.2 14B Diffusers currently supports multi-GPU inference? I had a quick stab at the code here : https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers

It does not seem to work? Although the checklist mentions multi-gpu support, I'm not sure if that's for the diffusers version?

@yiyixuxu yiyixuxu mentioned this pull request Aug 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
8 participants