RLVR Training Apertus 8B with GRPO on GSM8K dataset

This model is a fine-tuned version of the Apertus 8B Instruct model, further trained using the RLVR (Reinforcement Learning with Verifiable Rewards) framework on the GSM8K dataset. The base Apertus models are introduced in the paper Apertus: Democratizing Open and Compliant LLMs for Global Language Environments.

Results

Validation accuracy improved from 46.41% to 66.23%.

Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.

Rollouts
`num_unique_prompts_rollout`	32
`num_samples_per_prompt_rollout`	8
`temperature`	0.8
Optimization
`learning_rate`	3.0e-7
`beta`	0.01

Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use <think> </think>.
Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.

This work builds upon and was inspired by the following contributions:

RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.

Downloads last month: -; Downloads are not tracked for this model. How to track

Base model

Finetuned

Finetuned

(6)

this model