---
language:
- en
library_name: transformers
license: apache-2.0
tags:
- veronica
- polymorphic-mlp
- mixture-of-branches
- entropy-regularized-routing
- decoder-only
- causal-lm
- rope
- expandable-architecture
- research
pipeline_tag: text-generation
datasets:
- codelion/finepdfs-1B
- codelion/dclm-baseline-1B
- codelion/fineweb-edu-1B
model-index:
- name: Veronica-Polymorphic 24L (551M)
  results: []
---

# Veronica-Polymorphic 24L (551M)

Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:  
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.

The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.

> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.  
> Do **not** treat this as a production-ready model.

---

## 1. TL;DR

| Aspect              | Value / Description                                            |
|---------------------|----------------------------------------------------------------|
| Type                | Decoder-only causal LM                                         |
| Params              | ~551M                                                          |
| Layers              | 24                                                             |
| Hidden size         | 768                                                            |
| Heads               | 12                                                             |
| Positional encoding | RoPE (rotary)                                                 |
| MLP                 | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block          |
| Routing             | Entropy-regularized soft routing, depth-scaled temperature    |
| Precision           | bf16 weights, fp32 LayerNorm                                  |
| Context length      | 1024 → 2048 (curriculum; 512 discouraged on 24L)              |
| Data mix            | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20%      |
| Intended use        | Research on routing / branch specialization                   |
| Not included        | Instruction tuning, RLHF, safety fine-tuning, eval suite      |

---

## 2. Intended use & scope

### Primary intent

This checkpoint is meant for:

- Researchers interested in:
  - **Mixture-of-branches / soft routing** in MLPs
  - Stability of routers on deeper (24L) architectures
  - Incremental model growth via **adding branches post-pretrain**
- Practitioners who want a **small, hackable codebase** to experiment with:
  - Polymorphic MLPs
  - Entropy-regularized routing
  - Context-length curricula

### Out of scope

This model is **not** designed or evaluated (yet) for:

- General-purpose assistant use
- Safety-critical or high-stakes decisions
- Deployment to end-users without additional filtering, alignment, and evaluation

---

## 3. Model details

### 3.1 Architecture (high-level)

Input tokens
  ↓
Token & position embeddings (RoPE on Q/K)
  ↓
[ VeronicaBlock × 24 ]
  VeronicaBlock:
    x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
      → Pre-LN → Polymorphic MLP (router + branches) → Residual
  ↓
Untied LM head → logits

Key design choices:

Decoder-only Transformer (causal LM)

Pre-LayerNorm blocks

RoPE positional encoding (no learned absolute positions)

Untied input embeddings / LM head

Gradient checkpointing used in training runs for memory efficiency


3.2 Polymorphic MLP & routing

Each block’s MLP is replaced by a polymorphic MLP:

router_logits = Router(x)      # Linear → GELU → Linear
alpha = softmax(router_logits / tau)

branches = [
  SwiGLU(x),
  GLU(x),
  DepthwiseConvMLP(x),
]

output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))

Branches:

Branch	Role	Sketch

SwiGLU	Default gated MLP	Linear(up) → split → SiLU×gate → Linear(down)
GLU	Alternative gating dynamics	Linear(up) → split → Sigmoid×gate → Linear(down)
DepthwiseConv	Local token patterns / n-grams	Depthwise causal conv (k=3) → MLP


Routing controls:

Temperature schedule tau_start → tau_end (higher early = softer mixing)

Entropy-max aux-loss: encourages non-collapsed branch usage

Depth-scaled parameters:

Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models


The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.


---

4. Training data

The pre-train data follows the codelion / DataComp LM mixture guidelines:

Dataset	Share	Description

codelion/finepdfs-1B	50%	Technical/academic PDFs (high semantic density)
codelion/dclm-baseline-1B	30%	General web corpus baseline
codelion/fineweb-edu-1B	20%	Educational / explanatory web data


Target token budget for this configuration: ~60B tokens (example setting).

For licensing and detailed descriptions, please refer to each dataset on Hugging Face.


If you reuse this mixture, please also cite:

@article{sharma2025billion,
  title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author  = {Sharma, Asankhaya},
  year    = {2025},
  url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}


---

5. Training procedure

> Note: numbers below describe the reference run configuration used to train this checkpoint.
You can adapt them for your own experiments.


5.1 Core hyperparameters

Hyperparameter	Value / Notes

Layers	24
Hidden size	768
Attention heads	12
MLP expansion	4×
Per-device batch size	4
Grad accumulation	8  (effective batch 32)
Optimizer / LR schedule	AdamW, lr=1.2e-4, cosine decay
Warmup	10% of total steps
Weight decay	0.01
Label smoothing	0.01
Precision	bf16 + fp32 LayerNorm
Max steps	60k (example target)


Example launch:

python scripts/train_veronica.py \
  --config configs/veronica-pretrain-24L.json \
  --dataset_paths data/mix_optimal_50_30_20 \
  --output_dir runs/veronica-pretrain-24L \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --max_steps 60000 \
  --learning_rate 1.2e-4 \
  --warmup_ratio 0.10 \
  --weight_decay 0.01 \
  --max_seq_len 1024 \
  --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
  --router_aux_start 0.008 --router_aux_end 0.016 \
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
  --rep_alpha 0.05 \
  --seed 42

5.2 Context-length curriculum & “512-token trap”

Empirical finding on 24-layer models:

Starting at 512 tokens caused router collapse around step ~3k:

One branch dominated (>70%), entropy dropped, other branches starved.


Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.


Recommended curriculum for 24L:

Steps 0–20k   : 1024 tokens
Steps 20k–60k : 2048 tokens

For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.

5.3 Router health during training

Training logs include entries like:

[router] alpha=[a0, a1, a2] entropy_norm=E

Healthy targets (rough guideline):

Phase	Steps	Entropy (norm)	Min branch share

Warmup	0–5k	≥ 0.90	≥ 0.25
Post-freeze	5k–10k	≥ 0.75	≥ 0.12
Stable	10k+	≥ 0.70	≥ 0.15


Collapsed routing typically shows up as:

Entropy < 0.65

One branch > 80% usage for many thousands of steps

Other branches stuck < 5–10%


The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.


---

6. Evaluation

6.1 Current evaluation status

At the time of this release:

No standardized benchmarks (e.g. lm-eval-harness) have been run yet.

There are no public numbers for:

MMLU (5-shot / 0-shot)

ARC-e / ARC-c

HellaSwag, PIQA, GSM8K, etc.


Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.

> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.


6.2 Planned evaluation (suggested)

If you adopt or extend Veronica-Polymorphic, consider running:

lm-eval-harness on:

mmlu, arc_challenge, arc_easy, hellaswag, piqa


Instruction / SFT (if you fine-tune):

Alpaca-style or OpenAssistant subsets


Ablations:

Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width

With / without entropy-max routing


Contributions of evaluation scripts and reported metrics are very welcome.


---

7. How to use

7.1 Loading from code

If you’re using the Veronica codebase directly:

from veronica import VeronicaConfig, VeronicaForCausalLM

cfg = VeronicaConfig(
    n_layer=24,
    num_funcs=3,  # SwiGLU, GLU, DepthwiseConv
)
model = VeronicaForCausalLM(cfg)
model.eval()

You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.

7.2 Simple generation example

from transformers import AutoTokenizer
from veronica import VeronicaForCausalLM, VeronicaConfig

tokenizer = AutoTokenizer.from_pretrained("gpt2")  # or your own tokenizer
config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)

prompt = "The theory of relativity states that"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.


---

8. Extensibility: adding new branches

One motivation for polymorphic MLPs is incremental expansion:

You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:

Expanding num_funcs

Initializing the new branch + router output slice

Running a short fine-tune with:

Router + new branch trainable

Optionally freezing the rest of the backbone during warmup


The repository includes utilities and example code for:

Adding a new branch type

Copying router weights and initializing the new column

Scheduling a short specialization fine-tune


For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.


---

9. Limitations & risks

This model:

May generate inaccurate or nonsensical text

May reproduce biases present in the underlying datasets

Is not instruction-tuned:

Does not follow natural-language instructions reliably

Can ignore prompts, hallucinate, or switch topics


Has no safety layer:

No explicit filtering of harmful/toxic content

No RLHF / preference optimization


Do not use Veronica-Polymorphic for:

Safety-critical systems

Medical, legal, or financial advice

Content moderation without extensive additional work

Any setting where unfiltered, biased generations would cause harm


---

10. Roadmap

Planned / desired directions:

Version	Goal

v0.1	Core polymorphic MLP + tests
v0.2	Stable router schedules + logging
v0.3	Configurable attention variants / FlashAttention
v0.4	Public evaluation scripts (lm-eval-harness)
v0.5	Reference instruction-tuned variant
v0.6	Example specialization branches (e.g. translation)


Community PRs are welcome, especially for:

Evaluation & ablations vs vanilla MLP baselines

New branch types and routing strategies

Practical recipes for SFT / alignment on top of Veronica


---

11. License

This model and code are released under the Apache-2.0 license.


---

12. Citation

If you use Veronica-Polymorphic in your work, please cite:

```
@misc{veronica-2025,
  title        = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
  author       = {Emanuele D'Angelo},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}
```

---

13. Acknowledgments

Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.

Dataset mixture ratios guided by codelion’s DataComp LM work.

RoPE implementation adapted from GPT-NeoX-style implementations.