Emergent Semantics — Model_64_BIT (272M)

This repository provides Model_64_BIT (272M) — an ablation model from the paper:

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

This checkpoint is designed to test whether a Transformer can learn robust language behavior when the entire input embedding layer is frozen and contains no semantic or visual signal.

Compared to Model_16_BIT, this model uses a larger frozen binary code (n_embed=64), but the codes are randomly generated rather than encoding the token index directly.

Key idea (what this ablation tests)

Each token is assigned a frozen 64-dimensional binary vector (n_embed=64).
These vectors are randomly generated, but constructed to guarantee a unique ID per token (no collisions by design).
The embedding layer is frozen throughout training (requires_grad = False).

To match the Transformer hidden size, the 64-dim embedding is expanded to 1024 via a non-trainable repetition: repeat_interleave(16) → 64 * 16 = 1024.

This makes the input compatible with the same d_model=1024 Transformer backbone while ensuring the embedding table itself is purely a fixed identifier space.

Important: parameter count difference (vs 335M models)

This checkpoint has ~272M parameters, while models with a standard n_embed=1024 embedding table (e.g. UNI_GLYPH / unfrozen baselines) are ~335M.

The reduction is primarily due to the smaller embedding matrix:

Standard embedding params: vocab_size * 1024 = 65536 * 1024 ≈ 67.1M
This model’s embedding params: vocab_size * 64 = 65536 * 64 ≈ 4.19M

So the Transformer backbone is the same, but the embedding table is much smaller, lowering total parameter count.

Model summary

Architecture: decoder-only Transformer (GPT-like)
Hidden size (d_model): 1024
Layers: 16
Heads: 32
Positional encoding: rotary embeddings
Activation: GELU
Tokenizer / vocab size: 65,536 (bvv241-2-3 compatible)
Input embeddings: frozen, binary, n_embed=64, expanded to 1024 by repetition (non-trainable)
Embedding initialization: random binary codes with unique per-token assignment (no collisions)
Output head: not tied to the input embeddings (trained separately)

Files in this repo (embedding reference)

For transparency and reproducibility, the explicit frozen embedding values are included in this repository.

embeddings.txt (human-readable reference; token → 64-bit vector):
https://huggingface.co/Bochkov/emergent-semantics-model-64-float-272m/blob/main/embeddings.txt

Note: Embeddings are shipped in this model repo (even though the tokenizer exists separately) to keep the model+embedding mapping self-contained and unambiguous.

Tokenizer

The intended tokenizer is bvv241-2-3 (same vocab size and indexing):

https://huggingface.co/Bochkov/bvv241-2-3

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is exact vocab alignment.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-64-bit-272m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-64-bit-272m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:日本國
#    </s><|

Verify the 64-bit frozen binary embeddings (sanity check)

The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=64) whose values are strictly binary (0/1). Each 64-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=16).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Bochkov/emergent-semantics-model-64-bit-272m"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})

# --- 1) Show embedding matrix shape (should be 65536 x 64) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 64)

# --- 2) Tokenize 'A' and show its token id  ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)

tid = ids[0]

# --- 3) Print the 64 dim vector and verify it is binary (0/1) ---
e64= W[tid]  # shape: (64)
print("64-dim embedding for token id", tid, ":", e64.tolist())

uniq = torch.unique(e64)
print("unique values in e64", uniq.tolist())

is_binary = torch.all((e64== 0) | (e64== 1)).item()
print("is strictly binary (0/1):", is_binary)

# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 64 = 16
e1024 = e64.repeat_interleave(scale)  # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())

# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

vocab_size: 65536
config: {'vocab_size': 65536, 'n_embed': 64, 'd_model': 1024, 'n_layer': 16, 'n_head': 32, 'scale': 16}
token_embeddings.weight shape: (65536, 64)
text='A'
ids: [65]
tokens: ['A']
64-dim embedding for token id 65 : [0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
unique values in e64 [0.0, 1.0]
is strictly binary (0/1): True
expanded embedding shape: (1024,)
expanded embedding first 128 values: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
is binary globally (0/1): True
non-binary entries: 0

Intended use

This model is intended for research only, especially for:

Comparisons vs Model_UNI_GLYPH (glyph/PCA frozen embeddings) and vs trainable-embedding baselines
Studying whether semantic structure emerges in Transformer blocks when the input embedding space is a random-but-unique identifier code
Ablations on embedding dimensionality (n_embed) while keeping the Transformer backbone fixed

Not intended for production deployment (no instruction tuning, safety tuning, or factuality guarantees).

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: 20

Collection including Bochkov/emergent-semantics-model-64-bit-272m

Emergent Semantics Beyond Token Embeddings

Collection

Paper: 2507.04886 (TMLR, Oct 2025). 'Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations' • 12 items • Updated 6 days ago • 1

Bochkov
/

emergent-semantics-model-64-bit-272m