Veronica / README.md
MhaWay's picture
Update README.md
4579484 verified
---
language:
- en
library_name: transformers
license: apache-2.0
tags:
- veronica
- polymorphic-mlp
- mixture-of-branches
- entropy-regularized-routing
- decoder-only
- causal-lm
- rope
- expandable-architecture
- research
pipeline_tag: text-generation
datasets:
- codelion/finepdfs-1B
- codelion/dclm-baseline-1B
- codelion/fineweb-edu-1B
model-index:
- name: Veronica-Polymorphic 24L (551M)
results: []
---
# Veronica-Polymorphic 24L (551M)
Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.
The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.
> Do **not** treat this as a production-ready model.
---
## 1. TL;DR
| Aspect | Value / Description |
|---------------------|----------------------------------------------------------------|
| Type | Decoder-only causal LM |
| Params | ~551M |
| Layers | 24 |
| Hidden size | 768 |
| Heads | 12 |
| Positional encoding | RoPE (rotary) |
| MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block |
| Routing | Entropy-regularized soft routing, depth-scaled temperature |
| Precision | bf16 weights, fp32 LayerNorm |
| Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) |
| Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% |
| Intended use | Research on routing / branch specialization |
| Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite |
---
## 2. Intended use & scope
### Primary intent
This checkpoint is meant for:
- Researchers interested in:
- **Mixture-of-branches / soft routing** in MLPs
- Stability of routers on deeper (24L) architectures
- Incremental model growth via **adding branches post-pretrain**
- Practitioners who want a **small, hackable codebase** to experiment with:
- Polymorphic MLPs
- Entropy-regularized routing
- Context-length curricula
### Out of scope
This model is **not** designed or evaluated (yet) for:
- General-purpose assistant use
- Safety-critical or high-stakes decisions
- Deployment to end-users without additional filtering, alignment, and evaluation
---
## 3. Model details
### 3.1 Architecture (high-level)
Input tokens
Token & position embeddings (RoPE on Q/K)
[ VeronicaBlock × 24 ]
VeronicaBlock:
x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
→ Pre-LN → Polymorphic MLP (router + branches) → Residual
Untied LM head → logits
Key design choices:
Decoder-only Transformer (causal LM)
Pre-LayerNorm blocks
RoPE positional encoding (no learned absolute positions)
Untied input embeddings / LM head
Gradient checkpointing used in training runs for memory efficiency
3.2 Polymorphic MLP & routing
Each block’s MLP is replaced by a polymorphic MLP:
router_logits = Router(x) # Linear → GELU → Linear
alpha = softmax(router_logits / tau)
branches = [
SwiGLU(x),
GLU(x),
DepthwiseConvMLP(x),
]
output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
Branches:
Branch Role Sketch
SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down)
GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down)
DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP
Routing controls:
Temperature schedule tau_start → tau_end (higher early = softer mixing)
Entropy-max aux-loss: encourages non-collapsed branch usage
Depth-scaled parameters:
Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.
---
4. Training data
The pre-train data follows the codelion / DataComp LM mixture guidelines:
Dataset Share Description
codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density)
codelion/dclm-baseline-1B 30% General web corpus baseline
codelion/fineweb-edu-1B 20% Educational / explanatory web data
Target token budget for this configuration: ~60B tokens (example setting).
For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
If you reuse this mixture, please also cite:
@article{sharma2025billion,
title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author = {Sharma, Asankhaya},
year = {2025},
url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
---
5. Training procedure
> Note: numbers below describe the reference run configuration used to train this checkpoint.
You can adapt them for your own experiments.
5.1 Core hyperparameters
Hyperparameter Value / Notes
Layers 24
Hidden size 768
Attention heads 12
MLP expansion 4×
Per-device batch size 4
Grad accumulation 8 (effective batch 32)
Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay
Warmup 10% of total steps
Weight decay 0.01
Label smoothing 0.01
Precision bf16 + fp32 LayerNorm
Max steps 60k (example target)
Example launch:
python scripts/train_veronica.py \
--config configs/veronica-pretrain-24L.json \
--dataset_paths data/mix_optimal_50_30_20 \
--output_dir runs/veronica-pretrain-24L \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--max_steps 60000 \
--learning_rate 1.2e-4 \
--warmup_ratio 0.10 \
--weight_decay 0.01 \
--max_seq_len 1024 \
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
--router_aux_start 0.008 --router_aux_end 0.016 \
--router_force_prob 0.10 --router_force_warmup_steps 5000 \
--rep_alpha 0.05 \
--seed 42
5.2 Context-length curriculum & “512-token trap”
Empirical finding on 24-layer models:
Starting at 512 tokens caused router collapse around step ~3k:
One branch dominated (>70%), entropy dropped, other branches starved.
Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
Recommended curriculum for 24L:
Steps 0–20k : 1024 tokens
Steps 20k–60k : 2048 tokens
For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
5.3 Router health during training
Training logs include entries like:
[router] alpha=[a0, a1, a2] entropy_norm=E
Healthy targets (rough guideline):
Phase Steps Entropy (norm) Min branch share
Warmup 0–5k ≥ 0.90 ≥ 0.25
Post-freeze 5k–10k ≥ 0.75 ≥ 0.12
Stable 10k+ ≥ 0.70 ≥ 0.15
Collapsed routing typically shows up as:
Entropy < 0.65
One branch > 80% usage for many thousands of steps
Other branches stuck < 5–10%
The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
---
6. Evaluation
6.1 Current evaluation status
At the time of this release:
No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
There are no public numbers for:
MMLU (5-shot / 0-shot)
ARC-e / ARC-c
HellaSwag, PIQA, GSM8K, etc.
Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
6.2 Planned evaluation (suggested)
If you adopt or extend Veronica-Polymorphic, consider running:
lm-eval-harness on:
mmlu, arc_challenge, arc_easy, hellaswag, piqa
Instruction / SFT (if you fine-tune):
Alpaca-style or OpenAssistant subsets
Ablations:
Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
With / without entropy-max routing
Contributions of evaluation scripts and reported metrics are very welcome.
---
7. How to use
7.1 Loading from code
If you’re using the Veronica codebase directly:
from veronica import VeronicaConfig, VeronicaForCausalLM
cfg = VeronicaConfig(
n_layer=24,
num_funcs=3, # SwiGLU, GLU, DepthwiseConv
)
model = VeronicaForCausalLM(cfg)
model.eval()
You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
7.2 Simple generation example
from transformers import AutoTokenizer
from veronica import VeronicaForCausalLM, VeronicaConfig
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer
config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
prompt = "The theory of relativity states that"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=64,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
---
8. Extensibility: adding new branches
One motivation for polymorphic MLPs is incremental expansion:
You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
Expanding num_funcs
Initializing the new branch + router output slice
Running a short fine-tune with:
Router + new branch trainable
Optionally freezing the rest of the backbone during warmup
The repository includes utilities and example code for:
Adding a new branch type
Copying router weights and initializing the new column
Scheduling a short specialization fine-tune
For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
---
9. Limitations & risks
This model:
May generate inaccurate or nonsensical text
May reproduce biases present in the underlying datasets
Is not instruction-tuned:
Does not follow natural-language instructions reliably
Can ignore prompts, hallucinate, or switch topics
Has no safety layer:
No explicit filtering of harmful/toxic content
No RLHF / preference optimization
Do not use Veronica-Polymorphic for:
Safety-critical systems
Medical, legal, or financial advice
Content moderation without extensive additional work
Any setting where unfiltered, biased generations would cause harm
---
10. Roadmap
Planned / desired directions:
Version Goal
v0.1 Core polymorphic MLP + tests
v0.2 Stable router schedules + logging
v0.3 Configurable attention variants / FlashAttention
v0.4 Public evaluation scripts (lm-eval-harness)
v0.5 Reference instruction-tuned variant
v0.6 Example specialization branches (e.g. translation)
Community PRs are welcome, especially for:
Evaluation & ablations vs vanilla MLP baselines
New branch types and routing strategies
Practical recipes for SFT / alignment on top of Veronica
---
11. License
This model and code are released under the Apache-2.0 license.
---
12. Citation
If you use Veronica-Polymorphic in your work, please cite:
```
@misc{veronica-2025,
title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
author = {Emanuele D'Angelo},
year = {2025},
howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}
```
---
13. Acknowledgments
Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
Dataset mixture ratios guided by codelion’s DataComp LM work.
RoPE implementation adapted from GPT-NeoX-style implementations.