|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- veronica |
|
|
- polymorphic-mlp |
|
|
- mixture-of-branches |
|
|
- entropy-regularized-routing |
|
|
- decoder-only |
|
|
- causal-lm |
|
|
- rope |
|
|
- expandable-architecture |
|
|
- research |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- codelion/finepdfs-1B |
|
|
- codelion/dclm-baseline-1B |
|
|
- codelion/fineweb-edu-1B |
|
|
model-index: |
|
|
- name: Veronica-Polymorphic 24L (551M) |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# Veronica-Polymorphic 24L (551M) |
|
|
|
|
|
Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**: |
|
|
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token. |
|
|
|
|
|
The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable. |
|
|
|
|
|
> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**. |
|
|
> Do **not** treat this as a production-ready model. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1. TL;DR |
|
|
|
|
|
| Aspect | Value / Description | |
|
|
|---------------------|----------------------------------------------------------------| |
|
|
| Type | Decoder-only causal LM | |
|
|
| Params | ~551M | |
|
|
| Layers | 24 | |
|
|
| Hidden size | 768 | |
|
|
| Heads | 12 | |
|
|
| Positional encoding | RoPE (rotary) | |
|
|
| MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block | |
|
|
| Routing | Entropy-regularized soft routing, depth-scaled temperature | |
|
|
| Precision | bf16 weights, fp32 LayerNorm | |
|
|
| Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) | |
|
|
| Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% | |
|
|
| Intended use | Research on routing / branch specialization | |
|
|
| Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite | |
|
|
|
|
|
--- |
|
|
|
|
|
## 2. Intended use & scope |
|
|
|
|
|
### Primary intent |
|
|
|
|
|
This checkpoint is meant for: |
|
|
|
|
|
- Researchers interested in: |
|
|
- **Mixture-of-branches / soft routing** in MLPs |
|
|
- Stability of routers on deeper (24L) architectures |
|
|
- Incremental model growth via **adding branches post-pretrain** |
|
|
- Practitioners who want a **small, hackable codebase** to experiment with: |
|
|
- Polymorphic MLPs |
|
|
- Entropy-regularized routing |
|
|
- Context-length curricula |
|
|
|
|
|
### Out of scope |
|
|
|
|
|
This model is **not** designed or evaluated (yet) for: |
|
|
|
|
|
- General-purpose assistant use |
|
|
- Safety-critical or high-stakes decisions |
|
|
- Deployment to end-users without additional filtering, alignment, and evaluation |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. Model details |
|
|
|
|
|
### 3.1 Architecture (high-level) |
|
|
|
|
|
Input tokens |
|
|
↓ |
|
|
Token & position embeddings (RoPE on Q/K) |
|
|
↓ |
|
|
[ VeronicaBlock × 24 ] |
|
|
VeronicaBlock: |
|
|
x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual |
|
|
→ Pre-LN → Polymorphic MLP (router + branches) → Residual |
|
|
↓ |
|
|
Untied LM head → logits |
|
|
|
|
|
Key design choices: |
|
|
|
|
|
Decoder-only Transformer (causal LM) |
|
|
|
|
|
Pre-LayerNorm blocks |
|
|
|
|
|
RoPE positional encoding (no learned absolute positions) |
|
|
|
|
|
Untied input embeddings / LM head |
|
|
|
|
|
Gradient checkpointing used in training runs for memory efficiency |
|
|
|
|
|
|
|
|
3.2 Polymorphic MLP & routing |
|
|
|
|
|
Each block’s MLP is replaced by a polymorphic MLP: |
|
|
|
|
|
router_logits = Router(x) # Linear → GELU → Linear |
|
|
alpha = softmax(router_logits / tau) |
|
|
|
|
|
branches = [ |
|
|
SwiGLU(x), |
|
|
GLU(x), |
|
|
DepthwiseConvMLP(x), |
|
|
] |
|
|
|
|
|
output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches)) |
|
|
|
|
|
Branches: |
|
|
|
|
|
Branch Role Sketch |
|
|
|
|
|
SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down) |
|
|
GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down) |
|
|
DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP |
|
|
|
|
|
|
|
|
Routing controls: |
|
|
|
|
|
Temperature schedule tau_start → tau_end (higher early = softer mixing) |
|
|
|
|
|
Entropy-max aux-loss: encourages non-collapsed branch usage |
|
|
|
|
|
Depth-scaled parameters: |
|
|
|
|
|
Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models |
|
|
|
|
|
|
|
|
|
|
|
The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
4. Training data |
|
|
|
|
|
The pre-train data follows the codelion / DataComp LM mixture guidelines: |
|
|
|
|
|
Dataset Share Description |
|
|
|
|
|
codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density) |
|
|
codelion/dclm-baseline-1B 30% General web corpus baseline |
|
|
codelion/fineweb-edu-1B 20% Educational / explanatory web data |
|
|
|
|
|
|
|
|
Target token budget for this configuration: ~60B tokens (example setting). |
|
|
|
|
|
For licensing and detailed descriptions, please refer to each dataset on Hugging Face. |
|
|
|
|
|
|
|
|
If you reuse this mixture, please also cite: |
|
|
|
|
|
@article{sharma2025billion, |
|
|
title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, |
|
|
author = {Sharma, Asankhaya}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/} |
|
|
} |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
5. Training procedure |
|
|
|
|
|
> Note: numbers below describe the reference run configuration used to train this checkpoint. |
|
|
You can adapt them for your own experiments. |
|
|
|
|
|
|
|
|
|
|
|
5.1 Core hyperparameters |
|
|
|
|
|
Hyperparameter Value / Notes |
|
|
|
|
|
Layers 24 |
|
|
Hidden size 768 |
|
|
Attention heads 12 |
|
|
MLP expansion 4× |
|
|
Per-device batch size 4 |
|
|
Grad accumulation 8 (effective batch 32) |
|
|
Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay |
|
|
Warmup 10% of total steps |
|
|
Weight decay 0.01 |
|
|
Label smoothing 0.01 |
|
|
Precision bf16 + fp32 LayerNorm |
|
|
Max steps 60k (example target) |
|
|
|
|
|
|
|
|
Example launch: |
|
|
|
|
|
python scripts/train_veronica.py \ |
|
|
--config configs/veronica-pretrain-24L.json \ |
|
|
--dataset_paths data/mix_optimal_50_30_20 \ |
|
|
--output_dir runs/veronica-pretrain-24L \ |
|
|
--per_device_train_batch_size 4 \ |
|
|
--gradient_accumulation_steps 8 \ |
|
|
--max_steps 60000 \ |
|
|
--learning_rate 1.2e-4 \ |
|
|
--warmup_ratio 0.10 \ |
|
|
--weight_decay 0.01 \ |
|
|
--max_seq_len 1024 \ |
|
|
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \ |
|
|
--router_aux_start 0.008 --router_aux_end 0.016 \ |
|
|
--router_force_prob 0.10 --router_force_warmup_steps 5000 \ |
|
|
--rep_alpha 0.05 \ |
|
|
--seed 42 |
|
|
|
|
|
5.2 Context-length curriculum & “512-token trap” |
|
|
|
|
|
Empirical finding on 24-layer models: |
|
|
|
|
|
Starting at 512 tokens caused router collapse around step ~3k: |
|
|
|
|
|
One branch dominated (>70%), entropy dropped, other branches starved. |
|
|
|
|
|
|
|
|
Starting directly at 1024 tokens avoided collapse and produced stable, soft routing. |
|
|
|
|
|
|
|
|
Recommended curriculum for 24L: |
|
|
|
|
|
Steps 0–20k : 1024 tokens |
|
|
Steps 20k–60k : 2048 tokens |
|
|
|
|
|
For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended. |
|
|
|
|
|
5.3 Router health during training |
|
|
|
|
|
Training logs include entries like: |
|
|
|
|
|
[router] alpha=[a0, a1, a2] entropy_norm=E |
|
|
|
|
|
Healthy targets (rough guideline): |
|
|
|
|
|
Phase Steps Entropy (norm) Min branch share |
|
|
|
|
|
Warmup 0–5k ≥ 0.90 ≥ 0.25 |
|
|
Post-freeze 5k–10k ≥ 0.75 ≥ 0.12 |
|
|
Stable 10k+ ≥ 0.70 ≥ 0.15 |
|
|
|
|
|
|
|
|
Collapsed routing typically shows up as: |
|
|
|
|
|
Entropy < 0.65 |
|
|
|
|
|
One branch > 80% usage for many thousands of steps |
|
|
|
|
|
Other branches stuck < 5–10% |
|
|
|
|
|
|
|
|
The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
6. Evaluation |
|
|
|
|
|
6.1 Current evaluation status |
|
|
|
|
|
At the time of this release: |
|
|
|
|
|
No standardized benchmarks (e.g. lm-eval-harness) have been run yet. |
|
|
|
|
|
There are no public numbers for: |
|
|
|
|
|
MMLU (5-shot / 0-shot) |
|
|
|
|
|
ARC-e / ARC-c |
|
|
|
|
|
HellaSwag, PIQA, GSM8K, etc. |
|
|
|
|
|
|
|
|
|
|
|
Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation. |
|
|
|
|
|
> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc. |
|
|
|
|
|
|
|
|
|
|
|
6.2 Planned evaluation (suggested) |
|
|
|
|
|
If you adopt or extend Veronica-Polymorphic, consider running: |
|
|
|
|
|
lm-eval-harness on: |
|
|
|
|
|
mmlu, arc_challenge, arc_easy, hellaswag, piqa |
|
|
|
|
|
|
|
|
Instruction / SFT (if you fine-tune): |
|
|
|
|
|
Alpaca-style or OpenAssistant subsets |
|
|
|
|
|
|
|
|
Ablations: |
|
|
|
|
|
Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width |
|
|
|
|
|
With / without entropy-max routing |
|
|
|
|
|
|
|
|
|
|
|
Contributions of evaluation scripts and reported metrics are very welcome. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
7. How to use |
|
|
|
|
|
7.1 Loading from code |
|
|
|
|
|
If you’re using the Veronica codebase directly: |
|
|
|
|
|
from veronica import VeronicaConfig, VeronicaForCausalLM |
|
|
|
|
|
cfg = VeronicaConfig( |
|
|
n_layer=24, |
|
|
num_funcs=3, # SwiGLU, GLU, DepthwiseConv |
|
|
) |
|
|
model = VeronicaForCausalLM(cfg) |
|
|
model.eval() |
|
|
|
|
|
You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported. |
|
|
|
|
|
7.2 Simple generation example |
|
|
|
|
|
from transformers import AutoTokenizer |
|
|
from veronica import VeronicaForCausalLM, VeronicaConfig |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer |
|
|
config = VeronicaConfig.from_pretrained("MhaWay/Veronica") |
|
|
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config) |
|
|
|
|
|
prompt = "The theory of relativity states that" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=64, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
8. Extensibility: adding new branches |
|
|
|
|
|
One motivation for polymorphic MLPs is incremental expansion: |
|
|
|
|
|
You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by: |
|
|
|
|
|
Expanding num_funcs |
|
|
|
|
|
Initializing the new branch + router output slice |
|
|
|
|
|
Running a short fine-tune with: |
|
|
|
|
|
Router + new branch trainable |
|
|
|
|
|
Optionally freezing the rest of the backbone during warmup |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The repository includes utilities and example code for: |
|
|
|
|
|
Adding a new branch type |
|
|
|
|
|
Copying router weights and initializing the new column |
|
|
|
|
|
Scheduling a short specialization fine-tune |
|
|
|
|
|
|
|
|
For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
9. Limitations & risks |
|
|
|
|
|
This model: |
|
|
|
|
|
May generate inaccurate or nonsensical text |
|
|
|
|
|
May reproduce biases present in the underlying datasets |
|
|
|
|
|
Is not instruction-tuned: |
|
|
|
|
|
Does not follow natural-language instructions reliably |
|
|
|
|
|
Can ignore prompts, hallucinate, or switch topics |
|
|
|
|
|
|
|
|
Has no safety layer: |
|
|
|
|
|
No explicit filtering of harmful/toxic content |
|
|
|
|
|
No RLHF / preference optimization |
|
|
|
|
|
|
|
|
|
|
|
Do not use Veronica-Polymorphic for: |
|
|
|
|
|
Safety-critical systems |
|
|
|
|
|
Medical, legal, or financial advice |
|
|
|
|
|
Content moderation without extensive additional work |
|
|
|
|
|
Any setting where unfiltered, biased generations would cause harm |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
10. Roadmap |
|
|
|
|
|
Planned / desired directions: |
|
|
|
|
|
Version Goal |
|
|
|
|
|
v0.1 Core polymorphic MLP + tests |
|
|
v0.2 Stable router schedules + logging |
|
|
v0.3 Configurable attention variants / FlashAttention |
|
|
v0.4 Public evaluation scripts (lm-eval-harness) |
|
|
v0.5 Reference instruction-tuned variant |
|
|
v0.6 Example specialization branches (e.g. translation) |
|
|
|
|
|
|
|
|
Community PRs are welcome, especially for: |
|
|
|
|
|
Evaluation & ablations vs vanilla MLP baselines |
|
|
|
|
|
New branch types and routing strategies |
|
|
|
|
|
Practical recipes for SFT / alignment on top of Veronica |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
11. License |
|
|
|
|
|
This model and code are released under the Apache-2.0 license. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
12. Citation |
|
|
|
|
|
If you use Veronica-Polymorphic in your work, please cite: |
|
|
|
|
|
``` |
|
|
@misc{veronica-2025, |
|
|
title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling}, |
|
|
author = {Emanuele D'Angelo}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/MhaWay/Veronica}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
13. Acknowledgments |
|
|
|
|
|
Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature. |
|
|
|
|
|
Dataset mixture ratios guided by codelion’s DataComp LM work. |
|
|
|
|
|
RoPE implementation adapted from GPT-NeoX-style implementations. |
|
|
|