Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following Paper • 2508.02150 • Published Aug 4, 2025 • 36
mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit-DWQ Text Generation • 1B • Updated May 29, 2025 • 194 • 8
argilla/ultrafeedback-binarized-preferences-cleaned Viewer • Updated Dec 11, 2023 • 60.9k • 2.72k • 159