Stefano Fiorucci's picture
In a Training Loop ๐Ÿ”„

Stefano Fiorucci PRO

anakin87

AI & ML interests

Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework ๐Ÿ—๏ธ

Recent Activity

posted an update about 1 hour ago
๐Ÿ’ญ Do thinking traces make Language Models learn better? Curious what others think ๐—ฆ๐—ฐ๐—ฒ๐—ป๐—ฎ๐—ฟ๐—ถ๐—ผ You take an instruction-following LM. You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe. Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence... During training, the model could just output answers, but a common choice is to make it also output thinking traces. ๐—ง๐—ต๐—ฒ ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป Does forcing the model to produce thinking traces during training actually improve learningโ“ ๐Ÿ’ฌ I'd like to hear your thoughts. Share ideas and links to relevant papers and resources. From what I've understood so far, the answer seems to be ๐˜†๐—ฒ๐˜€. 1๏ธโƒฃ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance. 2๏ธโƒฃ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning. 3๏ธโƒฃ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
liked a model about 23 hours ago
NousResearch/nomos-1
upvoted a collection 12 days ago
INTELLECT-3
View all activity

Organizations

deepset's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture