Skip to content

Releases: huggingface/trl

v0.27.1

24 Jan 03:42

Choose a tag to compare

What's Changed

  • Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
  • fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
  • Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
  • Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
  • Fix RewardTrainer's results not reproducible by @liyc-ai in #4887

New Contributors

Full Changelog: v0.27.0...v0.27.1

v0.27.0

16 Jan 02:34
17acd61

Choose a tag to compare

Features

  • Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545
  • Preserve truncated tokens in BFD packing by @qgallouedec in #4632
  • Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
  • RLOO supports async rewards. by @pramodith in #4718
  • Support vLLM 0.12.0 by @jiqing-feng in #4117
  • feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
  • 🎭 Up to 50% less VRAM during forward with forward_masked_logits function by @qgallouedec in #4729
  • [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
  • Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
  • Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785

Experimental

  • Move AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in #4654
  • Move DPODataCollatorWithPadding to experimental.utils by @qgallouedec in #4667
  • Move DataCollatorForChatML to experimental.utils by @qgallouedec in #4668
  • Move add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in #4674
  • Move truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in #4677
  • Move prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in #4704
  • Move get_reward function to experimental.utils by @qgallouedec in #4683
  • Remove experimental imports from testing_utils by @albertvillanova in #4727
  • ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
  • Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
  • [GOLD] add probability merging fix to implement chain rule by @kashif in #4765
  • Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
  • Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808

Fixes

  • Accounting for case num_generations_eval=1 in the calculation of the advantage by @qgallouedec in #4662
  • Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
  • Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682
  • Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
  • Include generation_config for tiny model uploads by @qgallouedec in #4643
  • Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
  • Overwrite model default generation config used by model.generate by @albertvillanova in #4647
  • Fix: handle multiple tool calls in qwen3_schema by @mattbui in #4709
  • Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
  • Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
  • Monkey patch for HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in #4798
  • [fix] GRPOTrainer: proper access args by @carlyou in #4801
  • Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
  • fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
  • fix xpu vllm client server by @jiqing-feng in #4780

Documentation and Examples

Deprecations

CI Improvements

Read more

v0.26.2

18 Dec 15:55
8c26b7d

Choose a tag to compare

What's Changed

Full Changelog: v0.26.1...v0.26.2

v0.26.1

12 Dec 17:50

Choose a tag to compare

What's Changed

  • Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
  • Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682

New Contributors

Full Changelog: v0.26.0...v0.26.1

v0.26.0

09 Dec 20:51
84794a7

Choose a tag to compare

Features

🕵️‍♂️ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    predictions = [completion[-1]["content"] for completion in completions]
    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

by @qgallouedec in #4300

ScaleRL: Add CISPO Loss

CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.

GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.

by @pramodith in #4495

Add vLLM quantization option for colocate

When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.

by @sergiopaniego in #4496

Reasoning reward

TRL nows includes a reasoning reward function

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
        }
    ],
]
reasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0] 

As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.

from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward

trainer = GRPOTrainer(
    ...,
    reward_funcs=reasoning_accuracy_reward,
)

by @lewtun in #4563

Add shuffle_dataset option to SFTTrainer

You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.

from trl import SFTTrainer, SFTConfig

SFTConfig(shuffle_dataset=True)

by @qgallouedec in #4564

Add SAPO Loss in GRPO

Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.

You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.

by @pramodith in #4600

Other Features

Experimental

Fixes

Documentation and Examples

Read more

v0.25.1

12 Nov 16:51

Choose a tag to compare

What's Changed

  • Replace accelerate logging with stdlib in CLI by @lewtun in #4512
  • Add temporary workaround for lr_scheduler_kwargs dtype issue in Transformers 4.57.0 by @qgallouedec in #4513

Full Changelog: v0.25.0...0.25.1

v0.25.0

06 Nov 00:18
55f5433

Choose a tag to compare

Features

  • 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
  • Added custom prepare_model_for_kbit_training to save VRAM by @sergiopaniego in #4335
  • Add add_generation_prompt to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in #4361
  • Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in #4359
  • Support chat_template_kwargs by @pramodith in #4350
  • GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in #4303
  • Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in #4446
  • 💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in #4450

Experimental

Fixes

Documentation and Examples

Deprecations

What's Changed

Read more

v0.24.0

16 Oct 00:29
04fd120

Choose a tag to compare

Features

Fixes

Documentation

Deprecations

Experimental

What's Changed

Read more

v0.23.1

02 Oct 05:20

Choose a tag to compare

What's Changed

  • ♨️ [GRPO] Fix potential hang in get_high_entropy_mask by @akakakakakaa in #4041
  • Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
  • Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
  • 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
  • [SFTrainer]: Fix DFT Loss by @pramodith in #4112
  • ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170

New Contributors

Full Changelog: v0.23.0...v0.23.1

v0.23.0

10 Sep 04:39
6adfd13

Choose a tag to compare

Major

🥓 Context Parallelism

SFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.

Screenshot 2025-09-09 at 10 39 30 PM

by @kashif in #3994

🧨 Dynamic Fine-Tuning

Dynamic Fine-Tuning (DFT) is a nnow supported in TRL.

from trl import SFTConfig

training_args = SFTConfig(
    loss_type="dft",
    ...
)
Screenshot 2025-09-09 at 10 37 36 PM

by @qgallouedec in #4042

🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch

Different implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.

from trl import GRPOConfig

training_args = GRPOConfig(
    ...
    use_vllm=True,
    vllm_importance_sampling_correction=True, # default True
    vllm_importance_sampling_cap=2.0, # hyper-parameter C
)

by @LeonEricsson in #3867

🥣 [SFTTrainer]: Add Aux Loss for MoE models

Mixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.

training_args = SFTConfig(
    model_init_kwargs={"output_router_logits": True},
    ...
)

by @pramodith in #4012

💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode

When running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.

from trl import GRPOConfig

training_args = GRPOConfig(..., vllm_sleep_enabled=True)

by @edbeeching in #3968

⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer

You can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.

by @vaelev in #3783

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations

The paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.

by @behroozazarkhalili in #3990

Other Notable Changes

What's Changed

New Contributors

Full Changelog: v0.22.0...v0.23.0