Stefano Fiorucci PRO

anakin87

AI & ML interests

Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework 🏗️

Recent Activity

posted an update about 1 hour ago

💭 Do thinking traces make Language Models learn better? Curious what others think 𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼 You take an instruction-following LM. You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe. Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence... During training, the model could just output answers, but a common choice is to make it also output thinking traces. 𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 Does forcing the model to produce thinking traces during training actually improve learning❓ 💬 I'd like to hear your thoughts. Share ideas and links to relevant papers and resources. From what I've understood so far, the answer seems to be 𝘆𝗲𝘀. 1️⃣ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance. 2️⃣ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning. 3️⃣ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.

liked a model about 23 hours ago

NousResearch/nomos-1

upvoted a collection 12 days ago

INTELLECT-3

View all activity

Organizations

Posts 23

Post

💭 Do thinking traces make Language Models learn better? Curious what others think

𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼
You take an instruction-following LM.
You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe.
Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...

During training, the model could just output answers, but a common choice is to make it also output thinking traces.

𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻
Does forcing the model to produce thinking traces during training actually improve learning❓

💬 I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.

From what I've understood so far, the answer seems to be 𝘆𝗲𝘀.

1️⃣ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.

2️⃣ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.

3️⃣ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.

View all Posts