-
Notifications
You must be signed in to change notification settings - Fork 6k
Wan VACE #11582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wan VACE #11582
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
4a4b058
to
50b1216
Compare
Thank you x 100 |
In the examples, flow_shift should be 3.0 as resolution is 832 x 480 Resolution = 832 x 480 flow_shift = 3.0 01-output_T2V_0.0_2.mp4flow_shift = 5.0 01-output_T2V_0.0_1.mp4 |
tests/pipelines/wan/test_wan_vace.py
Outdated
generated_video = video[0] | ||
|
||
self.assertEqual(generated_video.shape, (17, 3, 16, 16)) | ||
expected_video = torch.randn(17, 3, 16, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we decided to replace these with expected slices right? If there is a change in numerical output, we wouldn't catch it right?
Hello Aryan, Is there any possibility to
Here is an interesting use case covered in video open_pose_video = load_video("openpose-man-contemporary-dance.mp4")[:num_frames] depth_video = load_video(......) # just example output = pipe( |
It should be possible with some modifications to the pipeline by accepting multiple control inputs and strengths, similar to how it's done in controlnets of other models. I can take a look at supporting it next week. If someone wants to take a look at this as a contribution in the mean time, this implementation is different from other controlnet implementations. Usually, you would load separate models per control type and do a weighted sum of their embeddings after a forward pass through each controlnet. Here, the controlnet layers are part of the main denoiser itself, instead of separating it out as another module. Although I did think to separate out the controlnet layers separetely, it is done this way to keep the implementation similar/close to original implementation since any future updates will be easier to add. To support multiple videos, you would have to:
cc @DN6 in case you have the bandwidth to take a look |
Thank you @a-r-r-o-w |
Checkpoints (temporary; only for the time being until official weights are hosted):
T2V
output2.mp4
output.mp4
I2V
output.mp4
V2LF
output.mp4
FLF2V
output.mp4
Random-to-V
Ideally, you should use similar looking images for consistent video generation. The example here is just for testing purposes with completely random images
output.mp4
Inpaint
Ideally, you should use a mask prepared with segmentation models for best editing.
peter-dance.mp4
output2.mp4
Outpaint
output.mp4
OpenPose
openpose-man-contemporary-dance.mp4
output.mp4
Inpaint with reference image
output.mp4