mmBERT-Embed-32K-2D-Matryoshka

A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.

Model Highlights

Feature Value
Parameters 307M
Context Length 32,768 tokens
Languages 1800+ (via Glot500)
Embedding Dim 768 (supports 64-768 via Matryoshka)
Architecture ModernBERT encoder with YaRN scaling

Key Results

Metric Score
MTEB Mean (24 tasks) 61.4
STS Benchmark 80.5 (exceeds Qwen3-0.6B's 76.17)
Dimension Retention 99% @ 256d, 98% @ 64d
Layer Speedup 3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3 1.6-3.1× faster (FA2 advantage)

What is 2D Matryoshka?

This model supports two dimensions of flexibility:

  1. Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
  2. Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference
Config Quality Speedup Storage
22L, 768d 100% 1.0× 100%
22L, 256d 99% 1.0× 33%
22L, 64d 98% 1.0× 8%
6L, 768d 56% 3.3× 100%
6L, 256d 56% 3.3× 33%

Usage

Basic Usage (Sentence Transformers)

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode sentences
sentences = [
    "This is a test sentence.",
    "这是一个测试句子。",
    "Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

Matryoshka Dimension Reduction

import torch.nn.functional as F

# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)

# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)

# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)

Long Context (up to 32K tokens)

# For long documents, set max_seq_length
model.max_seq_length = 8192  # or up to 32768

long_document = "..." * 10000  # Very long text
embedding = model.encode(long_document)

Layer Reduction (Advanced)

For latency-critical applications, you can extract embeddings from intermediate layers:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model = AutoModel.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    trust_remote_code=True,
    output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
    # Use layer 6 for 3.3× speedup (56% quality)
    hidden = outputs.hidden_states[6]
    hidden = model.final_norm(hidden)
    
    # Mean pooling
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1)
    embeddings = F.normalize(pooled, p=2, dim=1)

Evaluation Results

MTEB Benchmark (24 tasks)

Category Score
STS (7 tasks) 79.3
Classification (6) 62.4
Pair Classification (2) 76.2
Reranking (2) 64.4
Clustering (4) 36.9
Retrieval (3) 38.2
Overall Mean 61.4

STS Benchmark

Model Parameters STS Score
Qwen3-Embed-0.6B 600M 76.17
mmBERT-Embed 307M 80.5
Qwen3-Embed-8B 8B 81.08

2D Matryoshka Quality Matrix (STS)

Layers 768d 256d 64d
22L 80.5 79.9 78.5
11L 53.7 48.0 44.4
6L 45.2 45.2 43.5
3L 44.0 44.1 41.8

Long-Context Retrieval (4K tokens)

Metric Score
R@1 68.8%
R@10 81.2%
MRR 71.9%

Throughput (AMD MI300X)

Layers Throughput Speedup
22L 477/s 1.0×
11L 916/s 1.9×
6L 1573/s 3.3×
3L 2761/s 5.8×

Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B

mmBERT-Embed is significantly faster due to:

  1. Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
  2. Encoder architecture - Qwen3 uses decoder with causal masking
  3. Smaller model - 307M vs 569M/600M params

Batch Size = 1

Seq Len mmBERT-Embed Qwen3-0.6B BGE-M3 mmBERT Speedup
512 17.6ms (57/s) 20.7ms (48/s) 10.8ms (93/s) 0.6×
1024 18.6ms (54/s) 21.2ms (47/s) 16.3ms (61/s) 0.9×
2048 19.5ms (51/s) 24.1ms (42/s) 31.1ms (32/s) 1.6×
4096 21.3ms (47/s) 43.5ms (23/s) 60.5ms (17/s) 2.8×

Batch Size = 8

Seq Len mmBERT-Embed Qwen3-0.6B BGE-M3 mmBERT Speedup
512 21.1ms (379/s) 33.0ms (243/s) 40.0ms (200/s) 1.9×
1024 34.5ms (232/s) 58.5ms (137/s) 77.4ms (103/s) 2.2×
2048 65.2ms (123/s) 117.0ms (68/s) 162.9ms (49/s) 2.5×
4096 130.7ms (61/s) 254.9ms (31/s) 411.3ms (19/s) 3.1×

Key insight: The FA2 advantage grows with sequence length and batch size:

  • At short sequences (512), BGE-M3 is faster (no FA2 overhead)
  • At 2K+ tokens, mmBERT pulls ahead significantly
  • At 4K batch=8: mmBERT is 3.1× faster than BGE-M3

Benchmarked on AMD MI300X, bf16 precision.

Training

Data

Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:

  • Multilingual triplets (query, positive, negative)
  • Diverse domains and languages

Configuration

  • Base Model: llm-semantic-router/mmbert-32k-yarn
  • Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
  • Matryoshka Dimensions: [768, 512, 256, 128, 64]
  • Epochs: 1
  • Batch Size: 16 (effective 32 with gradient accumulation)
  • Learning Rate: 2e-5
  • Max Sequence Length: 32,768
  • Hardware: AMD Instinct MI300X

Use Cases

When to Use mmBERT-Embed

  1. Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
  2. Long-document retrieval where chunking loses cross-section relationships
  3. Edge deployment where 307M params matters vs 600M+
  4. Flexible inference where you need to trade quality for speed/storage at runtime

When to Use Alternatives

  • Maximum quality on major languages: Qwen3-Embed-8B
  • Production stability: BGE-M3 (more battle-tested)
  • Very short texts only: Smaller models may suffice

Limitations

  • Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
  • MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
  • Optimized for retrieval tasks; may need fine-tuning for other downstream tasks

Citation

@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}

ONNX Models for Production Deployment

Pre-exported ONNX models are available for production deployment with ONNX Runtime. Each layer model enables true early-exit speedup.

Available Models

Layer Size Latency Throughput Speedup Quality
onnx/layer-6 454 MB 2.56ms 390/sec 4.44× ~56%
onnx/layer-11 505 MB 4.87ms 205/sec 2.33× ~75%
onnx/layer-16 555 MB 7.64ms 131/sec 1.49× ~90%
onnx/layer-22 616 MB 11.37ms 88/sec 1.0× 100%

Benchmarked on AMD MI300X with ROCm, fp16 precision, batch=1, dynamic sequence length.

Batch Performance (batch=8)

Layer Throughput Speedup
6 634/sec 2.97×
11 428/sec 2.00×
16 286/sec 1.34×
22 214/sec 1.0×

Download ONNX Models

from huggingface_hub import hf_hub_download

# Download layer-6 for fast inference
model_path = hf_hub_download(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    "onnx/layer-6/model.onnx"
)
data_path = hf_hub_download(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    "onnx/layer-6/model.onnx.data"
)

Usage with ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka"
)

# Load ONNX model (use ROCMExecutionProvider for AMD GPU)
session = ort.InferenceSession(
    "onnx/layer-6/model.onnx",
    providers=["ROCMExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Inference
inputs = tokenizer("Hello world", return_tensors="np", padding=True)
outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
})
hidden_state = outputs[0]  # Shape: (batch, seq_len, 768)

# Mean pooling
import numpy as np
mask = inputs["attention_mask"][..., np.newaxis]
embeddings = (hidden_state * mask).sum(1) / mask.sum(1)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

Usage with Rust (ort crate)

use ort::{Session, execution_providers::ROCmExecutionProvider};

// Load models at startup
let fast_model = Session::builder()?
    .with_execution_providers([ROCmExecutionProvider::default().build()])?
    .commit_from_file("onnx/layer-6/model.onnx")?;

let full_model = Session::builder()?
    .with_execution_providers([ROCmExecutionProvider::default().build()])?
    .commit_from_file("onnx/layer-22/model.onnx")?;

// Runtime selection based on latency/quality needs
let embedding = if need_fast_response {
    fast_model.run(inputs)?   // ~2.6ms
} else {
    full_model.run(inputs)?   // ~11ms
};

Recommended Layer Selection

Use Case Layer Why
Real-time routing/classification 6 Lowest latency (2.56ms)
Balanced speed/quality 11 Good tradeoff (4.87ms)
High accuracy tasks 16 Near-full quality (7.64ms)
Search/RAG 22 Maximum quality (11.37ms)

Why Separate ONNX Models?

Unlike PyTorch where you can use output_hidden_states=True for runtime layer selection, ONNX graphs are static DAGs - all nodes execute regardless of which output you read. Separate model files are required for true early-exit speedup.

License

Apache 2.0

Downloads last month
1,308
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW

Model tree for llm-semantic-router/mmbert-embed-32k-2d-matryoshka

Quantized
(6)
this model
Finetunes
2 models

Collection including llm-semantic-router/mmbert-embed-32k-2d-matryoshka

Evaluation results