RLVR Training Apertus 8B with GRPO on GSM8K dataset
This model is a fine-tuned version of the Apertus 8B Instruct model, further trained using the RLVR (Reinforcement Learning with Verifiable Rewards) framework on the GSM8K dataset. The base Apertus models are introduced in the paper Apertus: Democratizing Open and Compliant LLMs for Global Language Environments.
Project Page: https://www.swiss-ai.org/apertus Code Repository: https://github.com/swiss-ai/apertus-tech-report
Results
Validation accuracy improved from 46.41% to 66.23%.
Compute
Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.
Hyperparameters
| Rollouts | |
|---|---|
num_unique_prompts_rollout | 32 |
num_samples_per_prompt_rollout | 8 |
temperature | 0.8 |
| Optimization | |
learning_rate | 3.0e-7 |
beta | 0.01 |
Notes
- Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use
<think> </think>. - Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.
Acknowledgements
This work builds upon and was inspired by the following contributions:
- RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
- Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
- Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.
Model tree for ABaroian/Apertus-8B-RLVR-GSM
Base model
swiss-ai/Apertus-8B-2509
Finetuned
swiss-ai/Apertus-8B-Instruct-2509