A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.
Model Highlights
Feature
Value
Parameters
307M
Context Length
32,768 tokens
Languages
1800+ (via Glot500)
Embedding Dim
768 (supports 64-768 via Matryoshka)
Architecture
ModernBERT encoder with YaRN scaling
Key Results
Metric
Score
MTEB Mean (24 tasks)
61.4
STS Benchmark
80.5 (exceeds Qwen3-0.6B's 76.17)
Dimension Retention
99% @ 256d, 98% @ 64d
Layer Speedup
3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3
1.6-3.1× faster (FA2 advantage)
What is 2D Matryoshka?
This model supports two dimensions of flexibility:
Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference
Config
Quality
Speedup
Storage
22L, 768d
100%
1.0×
100%
22L, 256d
99%
1.0×
33%
22L, 64d
98%
1.0×
8%
6L, 768d
56%
3.3×
100%
6L, 256d
56%
3.3×
33%
Usage
Basic Usage (Sentence Transformers)
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
# Encode sentences
sentences = [
"This is a test sentence.",
"这是一个测试句子。",
"Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 768)
Matryoshka Dimension Reduction
import torch.nn.functional as F
# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)
# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)
# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)
Long Context (up to 32K tokens)
# For long documents, set max_seq_length
model.max_seq_length = 8192# or up to 32768
long_document = "..." * 10000# Very long text
embedding = model.encode(long_document)
Layer Reduction (Advanced)
For latency-critical applications, you can extract embeddings from intermediate layers:
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model = AutoModel.from_pretrained(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
trust_remote_code=True,
output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Use layer 6 for 3.3× speedup (56% quality)
hidden = outputs.hidden_states[6]
hidden = model.final_norm(hidden)
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
Evaluation Results
MTEB Benchmark (24 tasks)
Category
Score
STS (7 tasks)
79.3
Classification (6)
62.4
Pair Classification (2)
76.2
Reranking (2)
64.4
Clustering (4)
36.9
Retrieval (3)
38.2
Overall Mean
61.4
STS Benchmark
Model
Parameters
STS Score
Qwen3-Embed-0.6B
600M
76.17
mmBERT-Embed
307M
80.5
Qwen3-Embed-8B
8B
81.08
2D Matryoshka Quality Matrix (STS)
Layers
768d
256d
64d
22L
80.5
79.9
78.5
11L
53.7
48.0
44.4
6L
45.2
45.2
43.5
3L
44.0
44.1
41.8
Long-Context Retrieval (4K tokens)
Metric
Score
R@1
68.8%
R@10
81.2%
MRR
71.9%
Throughput (AMD MI300X)
Layers
Throughput
Speedup
22L
477/s
1.0×
11L
916/s
1.9×
6L
1573/s
3.3×
3L
2761/s
5.8×
Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B
use ort::{Session, execution_providers::ROCmExecutionProvider};
// Load models at startupletfast_model = Session::builder()?
.with_execution_providers([ROCmExecutionProvider::default().build()])?
.commit_from_file("onnx/layer-6/model.onnx")?;
letfull_model = Session::builder()?
.with_execution_providers([ROCmExecutionProvider::default().build()])?
.commit_from_file("onnx/layer-22/model.onnx")?;
// Runtime selection based on latency/quality needsletembedding = if need_fast_response {
fast_model.run(inputs)? // ~2.6ms
} else {
full_model.run(inputs)? // ~11ms
};
Recommended Layer Selection
Use Case
Layer
Why
Real-time routing/classification
6
Lowest latency (2.56ms)
Balanced speed/quality
11
Good tradeoff (4.87ms)
High accuracy tasks
16
Near-full quality (7.64ms)
Search/RAG
22
Maximum quality (11.37ms)
Why Separate ONNX Models?
Unlike PyTorch where you can use output_hidden_states=True for runtime layer selection, ONNX graphs are static DAGs - all nodes execute regardless of which output you read. Separate model files are required for true early-exit speedup.