Configuration Parsing Warning: Config file tokenizer_config.json cannot be fetched (too big)

Model Card for psp-dada/Qwen2.5-Math-7B-Uni-DPO | ICLR 2026 | Uni-DPO:
A Unified Paradigm for Dynamic Preference Optimization of LLMs

🎊 News

  • [2026.02.16] πŸ“– Code, data, and models are released!
  • [2026.01.26] πŸŽ‰ Our Uni-DPO is accepted by ICLR 2026!

πŸš€ Overview

Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.

Key advantages:

  • Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
  • Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
  • Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.

πŸ”‘ Key Features

  • Dual-perspective dynamic weighting for preference optimization. Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.

  • Quality-aware weighting filters ambiguous preference pairs. Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.

  • Performance-aware weighting mitigates overfitting during training. High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.

  • Decoupling data quality from learning difficulty. Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.

  • State-of-the-art performance across text, math, and multimodal benchmarks. Uni-DPO consistently outperforms DPO and SimPO across diverse settings.

How to use

For the details of this model, please refer to the documentation of the GitHub repo.

πŸ“ Citation

If you find our model/code/data/paper helpful, please consider citing our papers πŸ“ and starring us ⭐️!

@inproceedings{peng2026unidpo,
  title     = {Uni-{DPO}: A Unified Paradigm for Dynamic Preference Optimization of {LLM}s},
  author    = {Shangpin Peng and Weinong Wang and Zhuotao Tian and Senqiao Yang and Xing W and Haotian Xu and Chengquan Zhang and Takashi Isobe and Baotian Hu and Min Zhang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=G7DBGlgjjp}
}

πŸ“§ Contact us

If you have any questions, comments, or suggestions, please do not hesitate to submit an issue or PR to help advance research in this area.

Downloads last month
19
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for psp-dada/Qwen2.5-Math-7B-Uni-DPO

Base model

Qwen/Qwen2.5-7B
Finetuned
(1007)
this model

Dataset used to train psp-dada/Qwen2.5-Math-7B-Uni-DPO

Collection including psp-dada/Qwen2.5-Math-7B-Uni-DPO

Paper for psp-dada/Qwen2.5-Math-7B-Uni-DPO