Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz · 2025-05-07T18:58:53Z

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

TODOs:
✅ FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
✅ SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
✅ SkyReelsV2DiffusionForcingPipeline
✅ SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
✅ SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
✅ SkyReelsV2Pipeline
✅ SkyReelsV2ImageToVideoPipeline
✅ scripts/convert_skyreelsv2_to_diffusers.py

tolgacangoz/SkyReels-V2-Diffusers

⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo	`diffusers` integration
original_0_short.mp4	diffusers_0_short.mp4

seed 37 and num_frames 97
Original repo	`diffusers` integration
original_37_short.mp4	diffusers_37_short.mp4

seed 0 and num_frames 257
Original repo	`diffusers` integration
original_0_long.mp4	diffusers_0_long.mp4

seed 37 and num_frames 257
Original repo	`diffusers` integration
original_37_long.mp4	diffusers_37_long.mp4

!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

`prompt`="A penguin dances."	`diffusers` integration
	i2v-short.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame	Last Frame

0.mp4	1.mp4
2.mp4	3.mp4
4.mp4	5.mp4
6.mp4	7.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video	`diffusers` integration
video1.mp4	v2v.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris

ukaprch · 2025-05-08T15:47:38Z

It's about time. Thanks.

…tion mechanisms

Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.

Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.

…ds and begin reorganizing the forward pass.

…ethod

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

…V2 models and refactor imports to use `SkyReelsV2Transformer3DModel`.

…2Transformer3DModel` for consistency.

…els` from 16 to 36 for i2v conf.

…_dim` values for different model types.

…hing for improved flexibility

… model types by dynamically adjusting zero padding.

…substring matching for model directory checks

… across image and video pipelines

… scheduler for SkyReels pipelines, enhancing model integration

… Film Generative model, enhancing text-to-video generation examples, and updating model references throughout the API documentation.

… documentation, updating TOC and introducing new model and scheduler files.

…t flow matching scheduler parameter for I2V from 3.0 to 5.0, ensuring clarity in usage examples.

…elines, clarifying its role in asynchronous inference.

DN6 · 2025-06-09T03:29:22Z

Thank you @tolgacangoz @a-r-r-o-w Could you take a look please

…e to improve clarity.

tolgacangoz · 2025-06-10T07:52:08Z

Hi @nitinmukesh @tin2tin. You can make tests, reviews for this PR just as you have done in other PRs, if you want.

nitinmukesh · 2025-06-10T08:15:50Z

Thank you @tolgacangoz for making the feature available in diffusers.

I will test it now.

tolgacangoz and others added 6 commits May 7, 2025 21:55

Merge branch 'main' into skyreels-v2

899f41c

up

607b5ba

second draft

3ccf201

Merge branch 'main' into skyreels-v2

959ca1f

up

37ca14f

Merge branch 'main' into skyreels-v2

d80b505

tolgacangoz added 23 commits May 8, 2025 20:01

3rd draft

95d0621

4th draft

6f8a945

upup

e781084

style

4806660

up

0986e81

up

6a300f5

fix fn name

45e1680

update import structure for SkyReelsV2

c8a0c14

add SkyreelsV2 pipeline classes with backend requirements

47306b6

up

c5b8da9

up

5835eaa

add draft transformer_skyreels_v2.py with a custom WanModel and atten…

9d2880e

…tion mechanisms

up

2c0586e

split i2v and t2v pipes for diffusion forcing

52590ea

up

f318efa

Refactors the SkyReelsV2Transformer3DModel by removing unused metho…

9688a82

…ds and begin reorganizing the forward pass.

Refactors SkyReelsV2TransformerBlock to integrate its forward() m…

825c2c1

…ethod

Refactors SkyReelsV2AttnProcessor2_0 to enhance the forward() met…

d848500

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

Refactors SkyReelsV2Transformer3DModel to enhance the forward() m…

2f5a4e2

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

Refactors SkyReelsV2Transformer3DModel to improve the forward() m…

e5870dd

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

Refactors SkyReelsV2Transformer3DModel forward pass

d54e3e1

tolgacangoz added 22 commits June 7, 2025 14:38

convert i2v 1.3b

7759617

Update transformer configuration to include image_dim for SkyReels …

943cd3e

…V2 models and refactor imports to use `SkyReelsV2Transformer3DModel`.

Refactor transformer import in SkyReels V2 pipeline to use `SkyReelsV…

993d19d

…2Transformer3DModel` for consistency.

Update transformer configuration in SkyReels V2 to increase `in_chann…

7387e52

…els` from 16 to 36 for i2v conf.

Update transformer configuration in SkyReels V2 to set `added_kv_proj…

96af7eb

…_dim` values for different model types.

up

a6a7337

up

72ad13c

up

d069905

Add SkyReelsV2Pipeline support for T2V model type in conversion script

8142720

upp

326b6ed

Refactor model type checks in conversion script to use substring matc…

a462222

…hing for improved flexibility

upp

a8c057f

Fix shard path formatting in conversion script to accommodate varying…

6bdfbcf

… model types by dynamically adjusting zero padding.

Update sharded safetensors loading logic in conversion script to use …

db74f87

…substring matching for model directory checks

Update scheduler parameters in SkyReels V2 test files for consistency…

cc698b6

… across image and video pipelines

Refactor conversion script to initialize text encoder, tokenizer, and…

9a269a2

… scheduler for SkyReels pipelines, enhancing model integration

style

9fd9dba

Update documentation for SkyReels-V2, introducing the Infinite-length…

bc9eb42

… Film Generative model, enhancing text-to-video generation examples, and updating model references throughout the API documentation.

Add SkyReelsV2Transformer3DModel and FlowMatchUniPCMultistepScheduler…

de446ad

… documentation, updating TOC and introducing new model and scheduler files.

style

f2f6613

Update documentation for SkyReelsV2DiffusionForcingPipeline to correc…

b707a6c

…t flow matching scheduler parameter for I2V from 3.0 to 5.0, ensuring clarity in usage examples.

Add documentation for causal_block_size parameter in SkyReelsV2DF pip…

dc73267

…elines, clarifying its role in asynchronous inference.

tolgacangoz marked this pull request as ready for review June 8, 2025 18:01

tolgacangoz added 3 commits June 9, 2025 10:24

Simplify min_ar_step calculation in SkyReelsV2DiffusionForcingPipelin…

c2aab89

…e to improve clarity.

style and fix-copies

7ce7a96

style

32a6520

Merge branch 'main' into skyreels-v2

ca1a5f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz commented May 7, 2025 •

edited

Loading

ukaprch commented May 8, 2025

DN6 commented Jun 9, 2025

tolgacangoz commented Jun 10, 2025 •

edited

Loading

nitinmukesh commented Jun 10, 2025 •

edited

Loading

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Are you sure you want to change the base?

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Conversation

tolgacangoz commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original Work

T2V with Diffusion Forcing (OLD)

I2V with Diffusion Forcing (OLD)

FLF2V with Diffusion Forcing (OLD)

V2V with Diffusion Forcing (OLD)

Who can review?

ukaprch commented May 8, 2025

DN6 commented Jun 9, 2025

tolgacangoz commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nitinmukesh commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tolgacangoz commented May 7, 2025 •

edited

Loading

tolgacangoz commented Jun 10, 2025 •

edited

Loading

nitinmukesh commented Jun 10, 2025 •

edited

Loading