Closed
Description
Describe the bug
Under https://github.com/huggingface/diffusers/tree/main/examples/flux-control there are two files showing how to fine tune flux-control:
- train_control_flux.py
- train_control_lora_flux.py
Both of them have a bug when args.max_train_steps is None:
Starting from Line 905 we have following code:
if args.max_train_steps is None:
len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes)
num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps)
num_training_steps_for_scheduler = (
args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes
)
else:
num_training_steps_for_scheduler = args.max_train_steps * accelerator.num_processes
lr_scheduler = get_scheduler(
args.lr_scheduler,
optimizer=optimizer,
num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
num_training_steps=args.max_train_steps * accelerator.num_processes,
num_cycles=args.lr_num_cycles,
power=args.lr_power,
)
Note how it gets checked that args.max_train_steps
is None in the if, in this case a num_training_steps_for_scheduler gets prepared. However in Line 918 we use args.max_train_steps
num_training_steps=args.max_train_steps * accelerator.num_processes,
isntead of the prepared num_training_steps_for_scheduler and causing following error:
num_training_steps=args.max_train_steps * accelerator.num_processes,
~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
Reproduction
Training runs where the max_train_steps are not set, i.e.:
accelerate launch train_control_lora_flux.py \
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
--dataset_name="raulc0399/open_pose_controlnet" \
--output_dir="pose-control-lora" \
--mixed_precision="bf16" \
--train_batch_size=1 \
--rank=64 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_train_epochs=10 \
--validation_image="openpose.png" \
--validation_prompt="A couple, 4k photo, highly detailed" \
--offload \
--seed="0" \
--push_to_hub
Logs
System Info
Not relevant for the mentioned Bug.
Who can help?
No response