Emergent Semantics — Model_16_FLOAT (269M)

This repository provides Model_16_FLOAT (269M) — an ablation model from the paper:

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

This checkpoint is designed to study the effect of normalization / PCA-style processing in a minimal frozen embedding setting.

Unlike Model_UNI_GLYPH, this model does not use glyph-based embeddings. Instead, it uses a frozen 16-dimensional float embedding per token.

Key idea (what this ablation tests)

This model isolates the impact of having float frozen embeddings (with PCA + normalization) versus the strictly binary token-ID variant (Model_16_BIT):

n_embed = 16 per token (float components, not binary)
Embedding vectors are precomputed (PCA + L2 normalization) and then frozen
The embedding layer is never updated (requires_grad=False)
To match the Transformer hidden size, the 16-dim embedding is expanded to 1024 via a non-trainable repetition: repeat_interleave(64) → 16 * 64 = 1024

This lets you test whether the model’s behavior changes when the frozen token “identifier” is:

discrete + purely ID-like (16-bit), vs
continuous + normalized (16-float)

Important: parameter count difference (vs 335M models)

This checkpoint has ~269M parameters, while models with a standard n_embed=1024 embedding table (e.g. UNI_GLYPH / unfrozen baselines) are ~335M.

This difference is expected and comes primarily from the embedding matrix size:

Standard embedding params: vocab_size * 1024 = 65536 * 1024 ≈ 67.1M
This model’s embedding params: vocab_size * 16 = 65536 * 16 ≈ 1.0M

So the Transformer backbone is the same (layers/heads/d_model), but the embedding table is much smaller, reducing total parameters.

Model summary

Architecture: decoder-only Transformer (GPT-like)
Hidden size (d_model): 1024
Layers: 16
Heads: 32
Positional encoding: rotary embeddings
Activation: GELU
Tokenizer / vocab size: 65,536 (bvv241-2-3 compatible)
Input embeddings: frozen, n_embed=16 (float, PCA + L2 normalized), expanded to 1024 by repetition (non-trainable)
Output head: not tied to the input embeddings (trained separately)

Tokenizer

The intended tokenizer is bvv241-2-3 (same vocab size and indexing):

https://huggingface.co/Bochkov/bvv241-2-3

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is exact vocab alignment.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-16-float-269m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-16-float-269m", trust_remote_code=True)

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

Intended use

Research only, especially for:

Comparing Model_16_FLOAT vs Model_16_BIT (effect of continuous normalized vectors vs binary ID)
Comparing Model_16_FLOAT vs Model_UNI_GLYPH (effect of glyph-derived structure vs minimal vectors)
Studying emergent semantics when embeddings are frozen and non-semantic

Not intended for production deployment.

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: -

Collection including Bochkov/emergent-semantics-model-16-float-269m

Emergent Semantics Beyond Token Embeddings

Collection

Paper: 2507.04886 (TMLR, Oct 2025). 'Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations' • 12 items • Updated about 3 hours ago

Bochkov
/

emergent-semantics-model-16-float-269m