Bochkov
/

demo_bvv_ru

@@ -6,12 +6,12 @@ tags:
 - causal-lm
 - frozen-embeddings
 - conceptual-demo
-- MoE-ready
 - transformer
 pipeline_tag: text-generation
 library_name: transformers
 ---
-# best_bvv_ru
 This repository contains the model and associated resources from the papers
@@ -21,28 +21,30 @@ This repository contains the model and associated resources from the papers
 [💻 Code](https://github.com/AVBochkov/Embeddings)
 **Proof-of-concept Transformer LM with frozen, non-semantic token embeddings trained on a small English-Russian corpus.**
 **This model is part of a series of models designed to demonstrate:**
 - The viability of transformer language models where the embedding layer is precomputed from non-semantic (Unicode/visual) features and entirely _frozen_ during training.
 - The possibility of modular/federated model fusion (MoE) by combining models with a shared token embedding matrix, without any additional retraining or alignment.
-## Model facts
 - **Parameters:** 0.5B
 - **Architecture:** 16-layer transformer, rotary attention, 1024 context, 32 heads.
 - **Embedding:** Precomputed, _frozen_ visual/Unicode-based.
 - **Training corpus:** Small-scale, <10B tokens, ~10% SFT-mixed (for metric tracking, not strong performance).
 - **Languages:** Russian, English.
-- **MoE compatibility:** Embedding space is shared with other `bvv` models (e.g. `Bochkov/best_bvv_zh`) enabling seamless MoE or model fusion at output head level.
 ## Key points
 This model was trained on a small corpus and is intended only to demonstrate the viability of frozen, visual/Unicode-derived embeddings for training and transfer between languages.
 Performance is not comparable to SOTA but shows competitive compositional skills versus a fully trainable embedding baseline.
-For direct benchmarking, see also [Bochkov/best_bvv_unfrozen_ru] — an identical architecture and dataset, but with standard trainable token embeddings.
-Enables seamless fusion/MoE with Bochkov/best_bvv_zh and Bochkov/best_bvv_moe (merged model) due to shared embedding space.
 Main evaluation
 MMLU avg: 22.3% ±0.1
 ARC-e: 23.0%
@@ -51,8 +53,32 @@ CommonsenseQA: 20.1%
 SQUAD: 14.8%
 BLEU [en-ru]: 6.4%
 BLEU [ru-en]: 8.8%
 ## 🧑‍🔬 Citation & Concept
-If you use or build upon this demo, please cite:
 ```
 @article{
       bochkov2025emergent,
@@ -75,21 +101,3 @@ If you use or build upon this demo, please cite:
       url={https://arxiv.org/abs/2507.07129},
 }
 ```
-This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
-## Example Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model = AutoModelForCausalLM.from_pretrained('Bochkov/best_bvv_ru', trust_remote_code=True).to('cuda')
-tokenizer = AutoTokenizer.from_pretrained('Bochkov/best_bvv_ru')
-inputs = tokenizer("Hello, мир! ", return_tensors="pt").to('cuda')
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=100,
-    temperature=0.8,
-    top_k=50,
-    top_p=0.95,
-    do_sample=True
-)
-print(tokenizer.decode(outputs[0]))
-```

 - causal-lm
 - frozen-embeddings
 - conceptual-demo
 - transformer
 pipeline_tag: text-generation
 library_name: transformers
 ---
+# demo_bvv_ru
 This repository contains the model and associated resources from the papers
 [💻 Code](https://github.com/AVBochkov/Embeddings)
+---
+## Model summary
 **Proof-of-concept Transformer LM with frozen, non-semantic token embeddings trained on a small English-Russian corpus.**
 **This model is part of a series of models designed to demonstrate:**
 - The viability of transformer language models where the embedding layer is precomputed from non-semantic (Unicode/visual) features and entirely _frozen_ during training.
 - The possibility of modular/federated model fusion (MoE) by combining models with a shared token embedding matrix, without any additional retraining or alignment.
 - **Parameters:** 0.5B
 - **Architecture:** 16-layer transformer, rotary attention, 1024 context, 32 heads.
 - **Embedding:** Precomputed, _frozen_ visual/Unicode-based.
 - **Training corpus:** Small-scale, <10B tokens, ~10% SFT-mixed (for metric tracking, not strong performance).
 - **Languages:** Russian, English.
+- **MoE compatibility:** Embedding space is shared with other `bvv` models (e.g. `Bochkov/demo_bvv_zh`) enabling seamless MoE or model fusion at output head level.
 ## Key points
 This model was trained on a small corpus and is intended only to demonstrate the viability of frozen, visual/Unicode-derived embeddings for training and transfer between languages.
 Performance is not comparable to SOTA but shows competitive compositional skills versus a fully trainable embedding baseline.
+For direct benchmarking, see also [Bochkov/demo_bvv_unfrozen_ru] — an identical architecture and dataset, but with standard trainable token embeddings.
+Enables seamless fusion/MoE with Bochkov/demo_bvv_zh and Bochkov/demo_bvv_moe (merged model) due to shared embedding space.
 Main evaluation
 MMLU avg: 22.3% ±0.1
 ARC-e: 23.0%
 SQUAD: 14.8%
 BLEU [en-ru]: 6.4%
 BLEU [ru-en]: 8.8%
+This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
+## Example Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained('Bochkov/demo_bvv_ru', trust_remote_code=True).to('cuda')
+tokenizer = AutoTokenizer.from_pretrained('Bochkov/demo_bvv_ru')
+inputs = tokenizer("Hello, мир! ", return_tensors="pt").to('cuda')
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=100,
+    temperature=0.8,
+    top_k=50,
+    top_p=0.95,
+    do_sample=True
+)
+print(tokenizer.decode(outputs[0]))
+```
 ## 🧑‍🔬 Citation & Concept
+If you find this work helpful or inspiring, please consider citing the associated papers:
 ```
 @article{
       bochkov2025emergent,
       url={https://arxiv.org/abs/2507.07129},
 }
 ```