--- language: - en library_name: transformers license: apache-2.0 tags: - veronica - polymorphic-mlp - mixture-of-branches - entropy-regularized-routing - decoder-only - causal-lm - rope - expandable-architecture - research pipeline_tag: text-generation datasets: - codelion/finepdfs-1B - codelion/dclm-baseline-1B - codelion/fineweb-edu-1B model-index: - name: Veronica-Polymorphic 24L (551M) results: [] --- # Veronica-Polymorphic 24L (551M) Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**: each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token. The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable. > ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**. > Do **not** treat this as a production-ready model. --- ## 1. TL;DR | Aspect | Value / Description | |---------------------|----------------------------------------------------------------| | Type | Decoder-only causal LM | | Params | ~551M | | Layers | 24 | | Hidden size | 768 | | Heads | 12 | | Positional encoding | RoPE (rotary) | | MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block | | Routing | Entropy-regularized soft routing, depth-scaled temperature | | Precision | bf16 weights, fp32 LayerNorm | | Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) | | Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% | | Intended use | Research on routing / branch specialization | | Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite | --- ## 2. Intended use & scope ### Primary intent This checkpoint is meant for: - Researchers interested in: - **Mixture-of-branches / soft routing** in MLPs - Stability of routers on deeper (24L) architectures - Incremental model growth via **adding branches post-pretrain** - Practitioners who want a **small, hackable codebase** to experiment with: - Polymorphic MLPs - Entropy-regularized routing - Context-length curricula ### Out of scope This model is **not** designed or evaluated (yet) for: - General-purpose assistant use - Safety-critical or high-stakes decisions - Deployment to end-users without additional filtering, alignment, and evaluation --- ## 3. Model details ### 3.1 Architecture (high-level) Input tokens ↓ Token & position embeddings (RoPE on Q/K) ↓ [ VeronicaBlock × 24 ] VeronicaBlock: x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual → Pre-LN → Polymorphic MLP (router + branches) → Residual ↓ Untied LM head → logits Key design choices: Decoder-only Transformer (causal LM) Pre-LayerNorm blocks RoPE positional encoding (no learned absolute positions) Untied input embeddings / LM head Gradient checkpointing used in training runs for memory efficiency 3.2 Polymorphic MLP & routing Each block’s MLP is replaced by a polymorphic MLP: router_logits = Router(x) # Linear → GELU → Linear alpha = softmax(router_logits / tau) branches = [ SwiGLU(x), GLU(x), DepthwiseConvMLP(x), ] output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches)) Branches: Branch Role Sketch SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down) GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down) DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP Routing controls: Temperature schedule tau_start → tau_end (higher early = softer mixing) Entropy-max aux-loss: encourages non-collapsed branch usage Depth-scaled parameters: Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection. --- 4. Training data The pre-train data follows the codelion / DataComp LM mixture guidelines: Dataset Share Description codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density) codelion/dclm-baseline-1B 30% General web corpus baseline codelion/fineweb-edu-1B 20% Educational / explanatory web data Target token budget for this configuration: ~60B tokens (example setting). For licensing and detailed descriptions, please refer to each dataset on Hugging Face. If you reuse this mixture, please also cite: @article{sharma2025billion, title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author = {Sharma, Asankhaya}, year = {2025}, url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } --- 5. Training procedure > Note: numbers below describe the reference run configuration used to train this checkpoint. You can adapt them for your own experiments. 5.1 Core hyperparameters Hyperparameter Value / Notes Layers 24 Hidden size 768 Attention heads 12 MLP expansion 4× Per-device batch size 4 Grad accumulation 8 (effective batch 32) Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay Warmup 10% of total steps Weight decay 0.01 Label smoothing 0.01 Precision bf16 + fp32 LayerNorm Max steps 60k (example target) Example launch: python scripts/train_veronica.py \ --config configs/veronica-pretrain-24L.json \ --dataset_paths data/mix_optimal_50_30_20 \ --output_dir runs/veronica-pretrain-24L \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 8 \ --max_steps 60000 \ --learning_rate 1.2e-4 \ --warmup_ratio 0.10 \ --weight_decay 0.01 \ --max_seq_len 1024 \ --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \ --router_aux_start 0.008 --router_aux_end 0.016 \ --router_force_prob 0.10 --router_force_warmup_steps 5000 \ --rep_alpha 0.05 \ --seed 42 5.2 Context-length curriculum & “512-token trap” Empirical finding on 24-layer models: Starting at 512 tokens caused router collapse around step ~3k: One branch dominated (>70%), entropy dropped, other branches starved. Starting directly at 1024 tokens avoided collapse and produced stable, soft routing. Recommended curriculum for 24L: Steps 0–20k : 1024 tokens Steps 20k–60k : 2048 tokens For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended. 5.3 Router health during training Training logs include entries like: [router] alpha=[a0, a1, a2] entropy_norm=E Healthy targets (rough guideline): Phase Steps Entropy (norm) Min branch share Warmup 0–5k ≥ 0.90 ≥ 0.25 Post-freeze 5k–10k ≥ 0.75 ≥ 0.12 Stable 10k+ ≥ 0.70 ≥ 0.15 Collapsed routing typically shows up as: Entropy < 0.65 One branch > 80% usage for many thousands of steps Other branches stuck < 5–10% The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box. --- 6. Evaluation 6.1 Current evaluation status At the time of this release: No standardized benchmarks (e.g. lm-eval-harness) have been run yet. There are no public numbers for: MMLU (5-shot / 0-shot) ARC-e / ARC-c HellaSwag, PIQA, GSM8K, etc. Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation. > 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc. 6.2 Planned evaluation (suggested) If you adopt or extend Veronica-Polymorphic, consider running: lm-eval-harness on: mmlu, arc_challenge, arc_easy, hellaswag, piqa Instruction / SFT (if you fine-tune): Alpaca-style or OpenAssistant subsets Ablations: Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width With / without entropy-max routing Contributions of evaluation scripts and reported metrics are very welcome. --- 7. How to use 7.1 Loading from code If you’re using the Veronica codebase directly: from veronica import VeronicaConfig, VeronicaForCausalLM cfg = VeronicaConfig( n_layer=24, num_funcs=3, # SwiGLU, GLU, DepthwiseConv ) model = VeronicaForCausalLM(cfg) model.eval() You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported. 7.2 Simple generation example from transformers import AutoTokenizer from veronica import VeronicaForCausalLM, VeronicaConfig tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer config = VeronicaConfig.from_pretrained("MhaWay/Veronica") model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config) prompt = "The theory of relativity states that" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) > Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations. --- 8. Extensibility: adding new branches One motivation for polymorphic MLPs is incremental expansion: You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by: Expanding num_funcs Initializing the new branch + router output slice Running a short fine-tune with: Router + new branch trainable Optionally freezing the rest of the backbone during warmup The repository includes utilities and example code for: Adding a new branch type Copying router weights and initializing the new column Scheduling a short specialization fine-tune For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples. --- 9. Limitations & risks This model: May generate inaccurate or nonsensical text May reproduce biases present in the underlying datasets Is not instruction-tuned: Does not follow natural-language instructions reliably Can ignore prompts, hallucinate, or switch topics Has no safety layer: No explicit filtering of harmful/toxic content No RLHF / preference optimization Do not use Veronica-Polymorphic for: Safety-critical systems Medical, legal, or financial advice Content moderation without extensive additional work Any setting where unfiltered, biased generations would cause harm --- 10. Roadmap Planned / desired directions: Version Goal v0.1 Core polymorphic MLP + tests v0.2 Stable router schedules + logging v0.3 Configurable attention variants / FlashAttention v0.4 Public evaluation scripts (lm-eval-harness) v0.5 Reference instruction-tuned variant v0.6 Example specialization branches (e.g. translation) Community PRs are welcome, especially for: Evaluation & ablations vs vanilla MLP baselines New branch types and routing strategies Practical recipes for SFT / alignment on top of Veronica --- 11. License This model and code are released under the Apache-2.0 license. --- 12. Citation If you use Veronica-Polymorphic in your work, please cite: ``` @misc{veronica-2025, title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling}, author = {Emanuele D'Angelo}, year = {2025}, howpublished = {\url{https://huggingface.co/MhaWay/Veronica}} } ``` --- 13. Acknowledgments Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature. Dataset mixture ratios guided by codelion’s DataComp LM work. RoPE implementation adapted from GPT-NeoX-style implementations.