[Trainer] training with proper offloading #12189

KohakuBlueleaf · 2026-01-31T09:04:13Z

In this PR I proposed a modification where the trainer and sampler preparing mechanism support force full offloading

As the vram usage in training is not easy to estimate and we need some tricky mechanism for offloading (as pytorch autograd will cache all the intermediate state for backward, the temporary GPU tensor cannot be released correctly)

In this implementation, we utilise ComfyUI's native offloading implementation but with bypass forward mode, therefore the lora/optimizer part are always in GPU and base model will be offloaded correctly.

During Test, training SDXL with 1024x1024 resolution with full offloading and bs1, we can reach ~4GB vram usage only, which is lower than inference without offloading.

Some other updates in this PR:

bypass adapter device/dtype sanitize before moving
in_training context in model_management for specialized implementation (to ensure things are able to .backward())

KohakuBlueleaf added 4 commits January 31, 2026 16:58

Fix bypass dtype/device moving

40c7737

Force offloading mode for training

3593628

training context var

ec61c02

offloading implementation in training node

20bd2c0

KohakuBlueleaf requested review from Kosinkadink, comfyanonymous and guill as code owners January 31, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] training with proper offloading #12189

[Trainer] training with proper offloading #12189

KohakuBlueleaf commented Jan 31, 2026

Labels

1 participant

[Trainer] training with proper offloading #12189

Are you sure you want to change the base?

[Trainer] training with proper offloading #12189

Conversation

KohakuBlueleaf commented Jan 31, 2026

Labels

1 participant