Skip to content

Infinite (not literally) length video creation using LTX-Video? #11590

Closed
@nitinmukesh

Description

@nitinmukesh

First of all thanks to Aryan (0.9.7 integration) and DN6 (adding GGUF). Model is quite good and output is also promising.

I need help in creating continuous video using the last frame. 1 trick is to generate the video, extract the last frame and do inference. Is there any easy way where I can do this in loop.

My thought is

  1. Use text encoder to generate prompt embed once and then remove text encoders from memory
  2. Loop the inference code, once complete extract the last latent (preferred as I can upscale using LTXLatentUpsamplePipeline) frame or image and again create image1 and condition with that frame...and continue doing this for n iterations.
  3. Also need to save the video locally for each inference, otherwise OOM.

Any thoughts / suggestions?

import torch
import gc
from diffusers import GGUFQuantizationConfig
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline, LTXVideoTransformer3DModel
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_video, load_image

transformer_path = f"https://huggingface.co/wsbagnsv1/ltxv-13b-0.9.7-distilled-GGUF/blob/main/ltxv-13b-0.9.7-distilled-Q3_K_S.gguf"
# transformer_path = f"https://huggingface.co/wsbagnsv1/ltxv-13b-0.9.7-distilled-GGUF/blob/main/ltxv-13b-0.9.7-distilled-Q8_0.gguf"
transformer_gguf = LTXVideoTransformer3DModel.from_single_file(
    transformer_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = LTXConditionPipeline.from_pretrained(
    "Lightricks/LTX-Video-0.9.7-distilled", 
    transformer=transformer_gguf,
    torch_dtype=torch.bfloat16
)
# pipe.to("cuda")
# pipe.enable_sequential_cpu_offload()
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

height, width = 480, 832
num_frames = 151
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

prompt = "hyperrealistic digital artwork of a young woman walking confidently down a garden pathway, wearing white button-up blouse with puffed sleeves and blue denim miniskirt, long flowing light brown hair caught in gentle breeze, carrying a small black handbag, bright sunny day with blue sky and fluffy white clouds, lush green hedges and ornamental plants lining the stone pathway, traditional Asian-inspired architecture in background, photorealistic style with perfect lighting, unreal engine 5, ray tracing, 16K UHD. camera follows subject from front as she walks forward with elegant confidence"
image1 = load_image( "assets/ltx/00039.png" )
condition1 = LTXVideoCondition(
    image=image1,
    frame_index=0,
)
width=512
height=768
num_frames = 161

# LOOP HERE
latents = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    conditions=[condition1],
    width=width,
    height=height,
    num_frames=num_frames,
    guidance_scale=1.0,
    num_inference_steps=4,
    decode_timestep=0.05,
    decode_noise_scale=0.025,
    image_cond_noise_scale=0.0,
    guidance_rescale=0.7,
    generator=torch.Generator().manual_seed(42),
    output_type="latent",
).frames
# save video locally
# Update image1 = load_image( latent/image from current inference  to be used with next inference)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions