Skip to content

[BUG]: Using args.max_train_steps even if it is None in diffusers/examples/flux-control #11661

Closed
@Markus-Pobitzer

Description

@Markus-Pobitzer

Describe the bug

Under https://github.com/huggingface/diffusers/tree/main/examples/flux-control there are two files showing how to fine tune flux-control:

if args.max_train_steps is None:
      len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes)
      num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps)
      num_training_steps_for_scheduler = (
          args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes
      )
  else:
      num_training_steps_for_scheduler = args.max_train_steps * accelerator.num_processes

  lr_scheduler = get_scheduler(
      args.lr_scheduler,
      optimizer=optimizer,
      num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
      num_training_steps=args.max_train_steps * accelerator.num_processes,
      num_cycles=args.lr_num_cycles,
      power=args.lr_power,
  )

Note how it gets checked that args.max_train_steps is None in the if, in this case a num_training_steps_for_scheduler gets prepared. However in Line 918 we use args.max_train_steps

 num_training_steps=args.max_train_steps * accelerator.num_processes,

isntead of the prepared num_training_steps_for_scheduler and causing following error:

num_training_steps=args.max_train_steps * accelerator.num_processes,
                       ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Reproduction

Training runs where the max_train_steps are not set, i.e.:

accelerate launch train_control_lora_flux.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
  --dataset_name="raulc0399/open_pose_controlnet" \
  --output_dir="pose-control-lora" \
  --mixed_precision="bf16" \
  --train_batch_size=1 \
  --rank=64 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_train_epochs=10 \
  --validation_image="openpose.png" \
  --validation_prompt="A couple, 4k photo, highly detailed" \
  --offload \
  --seed="0" \
  --push_to_hub

Logs

System Info

Not relevant for the mentioned Bug.

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions