mmBERT-Embed-32K-2D-Matryoshka
A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.
Model Highlights
| Feature |
Value |
| Parameters |
307M |
| Context Length |
32,768 tokens |
| Languages |
1800+ (via Glot500) |
| Embedding Dim |
768 (supports 64-768 via Matryoshka) |
| Architecture |
ModernBERT encoder with YaRN scaling |
Key Results
| Metric |
Score |
| MTEB Mean (24 tasks) |
61.4 |
| STS Benchmark |
80.5 (exceeds Qwen3-0.6B's 76.17) |
| Dimension Retention |
99% @ 256d, 98% @ 64d |
| Layer Speedup |
3.3× @ 6L, 5.8× @ 3L |
| Latency vs BGE-M3 |
1.6-3.1× faster (FA2 advantage) |
What is 2D Matryoshka?
This model supports two dimensions of flexibility:
- Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
- Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference
| Config |
Quality |
Speedup |
Storage |
| 22L, 768d |
100% |
1.0× |
100% |
| 22L, 256d |
99% |
1.0× |
33% |
| 22L, 64d |
98% |
1.0× |
8% |
| 6L, 768d |
56% |
3.3× |
100% |
| 6L, 256d |
56% |
3.3× |
33% |
Usage
Basic Usage (Sentence Transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
sentences = [
"This is a test sentence.",
"这是一个测试句子。",
"Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
Matryoshka Dimension Reduction
import torch.nn.functional as F
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)
Long Context (up to 32K tokens)
model.max_seq_length = 8192
long_document = "..." * 10000
embedding = model.encode(long_document)
Layer Reduction (Advanced)
For latency-critical applications, you can extract embeddings from intermediate layers:
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model = AutoModel.from_pretrained(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
trust_remote_code=True,
output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
hidden = outputs.hidden_states[6]
hidden = model.final_norm(hidden)
mask = inputs["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
Evaluation Results
MTEB Benchmark (24 tasks)
| Category |
Score |
| STS (7 tasks) |
79.3 |
| Classification (6) |
62.4 |
| Pair Classification (2) |
76.2 |
| Reranking (2) |
64.4 |
| Clustering (4) |
36.9 |
| Retrieval (3) |
38.2 |
| Overall Mean |
61.4 |
STS Benchmark
| Model |
Parameters |
STS Score |
| Qwen3-Embed-0.6B |
600M |
76.17 |
| mmBERT-Embed |
307M |
80.5 |
| Qwen3-Embed-8B |
8B |
81.08 |
2D Matryoshka Quality Matrix (STS)
| Layers |
768d |
256d |
64d |
| 22L |
80.5 |
79.9 |
78.5 |
| 11L |
53.7 |
48.0 |
44.4 |
| 6L |
45.2 |
45.2 |
43.5 |
| 3L |
44.0 |
44.1 |
41.8 |
Long-Context Retrieval (4K tokens)
| Metric |
Score |
| R@1 |
68.8% |
| R@10 |
81.2% |
| MRR |
71.9% |
Throughput (AMD MI300X)
| Layers |
Throughput |
Speedup |
| 22L |
477/s |
1.0× |
| 11L |
916/s |
1.9× |
| 6L |
1573/s |
3.3× |
| 3L |
2761/s |
5.8× |
Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B
mmBERT-Embed is significantly faster due to:
- Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
- Encoder architecture - Qwen3 uses decoder with causal masking
- Smaller model - 307M vs 569M/600M params
Batch Size = 1
| Seq Len |
mmBERT-Embed |
Qwen3-0.6B |
BGE-M3 |
mmBERT Speedup |
| 512 |
17.6ms (57/s) |
20.7ms (48/s) |
10.8ms (93/s) |
0.6× |
| 1024 |
18.6ms (54/s) |
21.2ms (47/s) |
16.3ms (61/s) |
0.9× |
| 2048 |
19.5ms (51/s) |
24.1ms (42/s) |
31.1ms (32/s) |
1.6× |
| 4096 |
21.3ms (47/s) |
43.5ms (23/s) |
60.5ms (17/s) |
2.8× |
Batch Size = 8
| Seq Len |
mmBERT-Embed |
Qwen3-0.6B |
BGE-M3 |
mmBERT Speedup |
| 512 |
21.1ms (379/s) |
33.0ms (243/s) |
40.0ms (200/s) |
1.9× |
| 1024 |
34.5ms (232/s) |
58.5ms (137/s) |
77.4ms (103/s) |
2.2× |
| 2048 |
65.2ms (123/s) |
117.0ms (68/s) |
162.9ms (49/s) |
2.5× |
| 4096 |
130.7ms (61/s) |
254.9ms (31/s) |
411.3ms (19/s) |
3.1× |
Key insight: The FA2 advantage grows with sequence length and batch size:
- At short sequences (512), BGE-M3 is faster (no FA2 overhead)
- At 2K+ tokens, mmBERT pulls ahead significantly
- At 4K batch=8: mmBERT is 3.1× faster than BGE-M3
Benchmarked on AMD MI300X, bf16 precision.
Training
Data
Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:
- Multilingual triplets (query, positive, negative)
- Diverse domains and languages
Configuration
- Base Model: llm-semantic-router/mmbert-32k-yarn
- Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
- Matryoshka Dimensions: [768, 512, 256, 128, 64]
- Epochs: 1
- Batch Size: 16 (effective 32 with gradient accumulation)
- Learning Rate: 2e-5
- Max Sequence Length: 32,768
- Hardware: AMD Instinct MI300X
Use Cases
When to Use mmBERT-Embed
- Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
- Long-document retrieval where chunking loses cross-section relationships
- Edge deployment where 307M params matters vs 600M+
- Flexible inference where you need to trade quality for speed/storage at runtime
When to Use Alternatives
- Maximum quality on major languages: Qwen3-Embed-8B
- Production stability: BGE-M3 (more battle-tested)
- Very short texts only: Smaller models may suffice
Limitations
- Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
- MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
- Optimized for retrieval tasks; may need fine-tuning for other downstream tasks
Citation
@misc{mmbert-embed-2d-matryoshka,
title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
author={vLLM Semantic Router Team},
year={2025},
url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}
ONNX Models for Production Deployment
Pre-exported ONNX models are available for production deployment with ONNX Runtime. Each layer model enables true early-exit speedup.
Available Models
| Layer |
Size |
Latency |
Throughput |
Speedup |
Quality |
onnx/layer-6 |
454 MB |
2.56ms |
390/sec |
4.44× |
~56% |
onnx/layer-11 |
505 MB |
4.87ms |
205/sec |
2.33× |
~75% |
onnx/layer-16 |
555 MB |
7.64ms |
131/sec |
1.49× |
~90% |
onnx/layer-22 |
616 MB |
11.37ms |
88/sec |
1.0× |
100% |
Benchmarked on AMD MI300X with ROCm, fp16 precision, batch=1, dynamic sequence length.
Batch Performance (batch=8)
| Layer |
Throughput |
Speedup |
| 6 |
634/sec |
2.97× |
| 11 |
428/sec |
2.00× |
| 16 |
286/sec |
1.34× |
| 22 |
214/sec |
1.0× |
Download ONNX Models
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
"onnx/layer-6/model.onnx"
)
data_path = hf_hub_download(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
"onnx/layer-6/model.onnx.data"
)
Usage with ONNX Runtime (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka"
)
session = ort.InferenceSession(
"onnx/layer-6/model.onnx",
providers=["ROCMExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)
inputs = tokenizer("Hello world", return_tensors="np", padding=True)
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
})
hidden_state = outputs[0]
import numpy as np
mask = inputs["attention_mask"][..., np.newaxis]
embeddings = (hidden_state * mask).sum(1) / mask.sum(1)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
Usage with Rust (ort crate)
use ort::{Session, execution_providers::ROCmExecutionProvider};
let fast_model = Session::builder()?
.with_execution_providers([ROCmExecutionProvider::default().build()])?
.commit_from_file("onnx/layer-6/model.onnx")?;
let full_model = Session::builder()?
.with_execution_providers([ROCmExecutionProvider::default().build()])?
.commit_from_file("onnx/layer-22/model.onnx")?;
let embedding = if need_fast_response {
fast_model.run(inputs)?
} else {
full_model.run(inputs)?
};
Recommended Layer Selection
| Use Case |
Layer |
Why |
| Real-time routing/classification |
6 |
Lowest latency (2.56ms) |
| Balanced speed/quality |
11 |
Good tradeoff (4.87ms) |
| High accuracy tasks |
16 |
Near-full quality (7.64ms) |
| Search/RAG |
22 |
Maximum quality (11.37ms) |
Why Separate ONNX Models?
Unlike PyTorch where you can use output_hidden_states=True for runtime layer selection, ONNX graphs are static DAGs - all nodes execute regardless of which output you read. Separate model files are required for true early-exit speedup.
License
Apache 2.0