Skip to content

Conversation

@KohakuBlueleaf
Copy link
Contributor

In this PR I proposed a modification where the trainer and sampler preparing mechanism support force full offloading

As the vram usage in training is not easy to estimate and we need some tricky mechanism for offloading (as pytorch autograd will cache all the intermediate state for backward, the temporary GPU tensor cannot be released correctly)

In this implementation, we utilise ComfyUI's native offloading implementation but with bypass forward mode, therefore the lora/optimizer part are always in GPU and base model will be offloaded correctly.

During Test, training SDXL with 1024x1024 resolution with full offloading and bs1, we can reach ~4GB vram usage only, which is lower than inference without offloading.

Some other updates in this PR:

  1. bypass adapter device/dtype sanitize before moving
  2. in_training context in model_management for specialized implementation (to ensure things are able to .backward())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant