## Model Description
**DeepMath** is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with **GRPO (Group Relative Policy Optimization)**, DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.
- **Developed by:** Intel AI Labs
- **Model type:** Causal language model with agent capabilities
- **Language:** English
- **Base model:** [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
- **License:** Apache 2.0
- **Repository:** [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath)
## Key Features
✅ **Code-driven reasoning:** Generates short Python snippets for intermediate computational steps
✅ **Sandboxed execution:** No file I/O, no network calls, strict timeouts
✅ **Improved accuracy:** Offloading computation reduces arithmetic errors
✅ **Reduced verbosity:** Up to 66% shorter outputs compared to baseline
✅ **Safe and auditable:** Deterministic execution with readable code snippets
## Model Architecture
DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:
- **Agent Interface:** Outputs special tokens for Python code execution during reasoning
- **Executor:** Sandboxed Python environment with allow-listed modules
- **Safety Constraints:** Per-snippet timeouts, no file/network access
- **Training Method:** GRPO with accuracy and code generation rewards
Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.
**Key Findings:**
- **Accuracy:** Improved performance on challenging datasets (AIME, HMMT, HLE)
- **Efficiency:** Up to **66% reduction** in output length
- **Robustness:** Consistent improvements when combining agent + GRPO training
### Evaluation Datasets
- **MATH500:** Subset of the MATH dataset
- **AIME:** American Invitational Mathematics Examination problems
- **HMMT:** Harvard-MIT Mathematics Tournament problems
- **HLE:** High-level exam problems
Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.