Veronica / README.md

Update README.md

4579484 verified about 2 months ago

12.4 kB

	---
	language:
	- en
	library_name: transformers
	license: apache-2.0
	tags:
	- veronica
	- polymorphic-mlp
	- mixture-of-branches
	- entropy-regularized-routing
	- decoder-only
	- causal-lm
	- rope
	- expandable-architecture
	- research
	pipeline_tag: text-generation
	datasets:
	- codelion/finepdfs-1B
	- codelion/dclm-baseline-1B
	- codelion/fineweb-edu-1B
	model-index:
	- name: Veronica-Polymorphic 24L (551M)
	results: []
	---

	# Veronica-Polymorphic 24L (551M)

	Veronica-Polymorphic is a decoder-only language model (≈551M params) with a polymorphic MLP:
	each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a soft router that blends them per-token.

	The goal is adaptive capacity and incremental expansion (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.

	> ⚠️ Status: research preview, pre-training only, no external benchmarks yet.
	> Do not treat this as a production-ready model.

	---

	## 1. TL;DR

	\| Aspect \| Value / Description \|
	\|---------------------\|----------------------------------------------------------------\|
	\| Type \| Decoder-only causal LM \|
	\| Params \| ~551M \|
	\| Layers \| 24 \|
	\| Hidden size \| 768 \|
	\| Heads \| 12 \|
	\| Positional encoding \| RoPE (rotary) \|
	\| MLP \| Polymorphic (SwiGLU • GLU • DepthwiseConv) per block \|
	\| Routing \| Entropy-regularized soft routing, depth-scaled temperature \|
	\| Precision \| bf16 weights, fp32 LayerNorm \|
	\| Context length \| 1024 → 2048 (curriculum; 512 discouraged on 24L) \|
	\| Data mix \| FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% \|
	\| Intended use \| Research on routing / branch specialization \|
	\| Not included \| Instruction tuning, RLHF, safety fine-tuning, eval suite \|

	---

	## 2. Intended use & scope

	### Primary intent

	This checkpoint is meant for:

	- Researchers interested in:
	- Mixture-of-branches / soft routing in MLPs
	- Stability of routers on deeper (24L) architectures
	- Incremental model growth via adding branches post-pretrain
	- Practitioners who want a small, hackable codebase to experiment with:
	- Polymorphic MLPs
	- Entropy-regularized routing
	- Context-length curricula

	### Out of scope

	This model is not designed or evaluated (yet) for:

	- General-purpose assistant use
	- Safety-critical or high-stakes decisions
	- Deployment to end-users without additional filtering, alignment, and evaluation

	---

	## 3. Model details

	### 3.1 Architecture (high-level)

	Input tokens
	↓
	Token & position embeddings (RoPE on Q/K)
	↓
	[ VeronicaBlock × 24 ]
	VeronicaBlock:
	x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
	→ Pre-LN → Polymorphic MLP (router + branches) → Residual
	↓
	Untied LM head → logits

	Key design choices:

	Decoder-only Transformer (causal LM)

	Pre-LayerNorm blocks

	RoPE positional encoding (no learned absolute positions)

	Untied input embeddings / LM head

	Gradient checkpointing used in training runs for memory efficiency


	3.2 Polymorphic MLP & routing

	Each block’s MLP is replaced by a polymorphic MLP:

	router_logits = Router(x) # Linear → GELU → Linear
	alpha = softmax(router_logits / tau)

	branches = [
	SwiGLU(x),
	GLU(x),
	DepthwiseConvMLP(x),
	]

	output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))

	Branches:

	Branch Role Sketch

	SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down)
	GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down)
	DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP


	Routing controls:

	Temperature schedule tau_start → tau_end (higher early = softer mixing)

	Entropy-max aux-loss: encourages non-collapsed branch usage

	Depth-scaled parameters:

	Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models



	The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.


	---

	4. Training data

	The pre-train data follows the codelion / DataComp LM mixture guidelines:

	Dataset Share Description

	codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density)
	codelion/dclm-baseline-1B 30% General web corpus baseline
	codelion/fineweb-edu-1B 20% Educational / explanatory web data


	Target token budget for this configuration: ~60B tokens (example setting).

	For licensing and detailed descriptions, please refer to each dataset on Hugging Face.


	If you reuse this mixture, please also cite:

	@article{sharma2025billion,
	title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
	author = {Sharma, Asankhaya},
	year = {2025},
	url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
	}


	---

	5. Training procedure

	> Note: numbers below describe the reference run configuration used to train this checkpoint.
	You can adapt them for your own experiments.



	5.1 Core hyperparameters

	Hyperparameter Value / Notes

	Layers 24
	Hidden size 768
	Attention heads 12
	MLP expansion 4×
	Per-device batch size 4
	Grad accumulation 8 (effective batch 32)
	Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay
	Warmup 10% of total steps
	Weight decay 0.01
	Label smoothing 0.01
	Precision bf16 + fp32 LayerNorm
	Max steps 60k (example target)


	Example launch:

	python scripts/train_veronica.py \
	--config configs/veronica-pretrain-24L.json \
	--dataset_paths data/mix_optimal_50_30_20 \
	--output_dir runs/veronica-pretrain-24L \
	--per_device_train_batch_size 4 \
	--gradient_accumulation_steps 8 \
	--max_steps 60000 \
	--learning_rate 1.2e-4 \
	--warmup_ratio 0.10 \
	--weight_decay 0.01 \
	--max_seq_len 1024 \
	--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
	--router_aux_start 0.008 --router_aux_end 0.016 \
	--router_force_prob 0.10 --router_force_warmup_steps 5000 \
	--rep_alpha 0.05 \
	--seed 42

	5.2 Context-length curriculum & “512-token trap”

	Empirical finding on 24-layer models:

	Starting at 512 tokens caused router collapse around step ~3k:

	One branch dominated (>70%), entropy dropped, other branches starved.


	Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.


	Recommended curriculum for 24L:

	Steps 0–20k : 1024 tokens
	Steps 20k–60k : 2048 tokens

	For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.

	5.3 Router health during training

	Training logs include entries like:

	[router] alpha=[a0, a1, a2] entropy_norm=E

	Healthy targets (rough guideline):

	Phase Steps Entropy (norm) Min branch share

	Warmup 0–5k ≥ 0.90 ≥ 0.25
	Post-freeze 5k–10k ≥ 0.75 ≥ 0.12
	Stable 10k+ ≥ 0.70 ≥ 0.15


	Collapsed routing typically shows up as:

	Entropy < 0.65

	One branch > 80% usage for many thousands of steps

	Other branches stuck < 5–10%


	The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.


	---

	6. Evaluation

	6.1 Current evaluation status

	At the time of this release:

	No standardized benchmarks (e.g. lm-eval-harness) have been run yet.

	There are no public numbers for:

	MMLU (5-shot / 0-shot)

	ARC-e / ARC-c

	HellaSwag, PIQA, GSM8K, etc.



	Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.

	> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.



	6.2 Planned evaluation (suggested)

	If you adopt or extend Veronica-Polymorphic, consider running:

	lm-eval-harness on:

	mmlu, arc_challenge, arc_easy, hellaswag, piqa


	Instruction / SFT (if you fine-tune):

	Alpaca-style or OpenAssistant subsets


	Ablations:

	Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width

	With / without entropy-max routing



	Contributions of evaluation scripts and reported metrics are very welcome.


	---

	7. How to use

	7.1 Loading from code

	If you’re using the Veronica codebase directly:

	from veronica import VeronicaConfig, VeronicaForCausalLM

	cfg = VeronicaConfig(
	n_layer=24,
	num_funcs=3, # SwiGLU, GLU, DepthwiseConv
	)
	model = VeronicaForCausalLM(cfg)
	model.eval()

	You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.

	7.2 Simple generation example

	from transformers import AutoTokenizer
	from veronica import VeronicaForCausalLM, VeronicaConfig

	tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer
	config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
	model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)

	prompt = "The theory of relativity states that"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=64,
	temperature=0.7,
	top_p=0.9,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.




	---

	8. Extensibility: adding new branches

	One motivation for polymorphic MLPs is incremental expansion:

	You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:

	Expanding num_funcs

	Initializing the new branch + router output slice

	Running a short fine-tune with:

	Router + new branch trainable

	Optionally freezing the rest of the backbone during warmup




	The repository includes utilities and example code for:

	Adding a new branch type

	Copying router weights and initializing the new column

	Scheduling a short specialization fine-tune


	For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.


	---

	9. Limitations & risks

	This model:

	May generate inaccurate or nonsensical text

	May reproduce biases present in the underlying datasets

	Is not instruction-tuned:

	Does not follow natural-language instructions reliably

	Can ignore prompts, hallucinate, or switch topics


	Has no safety layer:

	No explicit filtering of harmful/toxic content

	No RLHF / preference optimization



	Do not use Veronica-Polymorphic for:

	Safety-critical systems

	Medical, legal, or financial advice

	Content moderation without extensive additional work

	Any setting where unfiltered, biased generations would cause harm



	---

	10. Roadmap

	Planned / desired directions:

	Version Goal

	v0.1 Core polymorphic MLP + tests
	v0.2 Stable router schedules + logging
	v0.3 Configurable attention variants / FlashAttention
	v0.4 Public evaluation scripts (lm-eval-harness)
	v0.5 Reference instruction-tuned variant
	v0.6 Example specialization branches (e.g. translation)


	Community PRs are welcome, especially for:

	Evaluation & ablations vs vanilla MLP baselines

	New branch types and routing strategies

	Practical recipes for SFT / alignment on top of Veronica



	---

	11. License

	This model and code are released under the Apache-2.0 license.


	---

	12. Citation

	If you use Veronica-Polymorphic in your work, please cite:

	```
	@misc{veronica-2025,
	title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
	author = {Emanuele D'Angelo},
	year = {2025},
	howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
	}
	```

	---

	13. Acknowledgments

	Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.

	Dataset mixture ratios guided by codelion’s DataComp LM work.

	RoPE implementation adapted from GPT-NeoX-style implementations.