--- library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - embeddings - multilingual - matryoshka - 2d-matryoshka - long-context - modernbert base_model: llm-semantic-router/mmbert-32k-yarn datasets: - BAAI/bge-m3-data language: - multilingual license: apache-2.0 pipeline_tag: sentence-similarity model-index: - name: mmbert-embed-32k-2d-matryoshka results: - task: type: STS dataset: name: STS Benchmark type: mteb/stsbenchmark-sts metrics: - type: spearman value: 80.5 --- # mmBERT-Embed-32K-2D-Matryoshka A **multilingual embedding model** with 32K context window and **2D Matryoshka** support for flexible efficiency-quality tradeoffs. ## Model Highlights | Feature | Value | |---------|-------| | **Parameters** | 307M | | **Context Length** | 32,768 tokens | | **Languages** | 1800+ (via Glot500) | | **Embedding Dim** | 768 (supports 64-768 via Matryoshka) | | **Architecture** | ModernBERT encoder with YaRN scaling | ### Key Results | Metric | Score | |--------|-------| | **MTEB Mean (24 tasks)** | **61.4** | | **STS Benchmark** | **80.5** (exceeds Qwen3-0.6B's 76.17) | | **Dimension Retention** | 99% @ 256d, 98% @ 64d | | **Layer Speedup** | 3.3× @ 6L, 5.8× @ 3L | | **Latency vs BGE-M3** | **1.6-3.1× faster** (FA2 advantage) | ## What is 2D Matryoshka? This model supports **two dimensions of flexibility**: 1. **Dimension Reduction** (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss 2. **Layer Reduction** (Adaptive): Use intermediate layer outputs for faster inference | Config | Quality | Speedup | Storage | |--------|---------|---------|---------| | 22L, 768d | 100% | 1.0× | 100% | | 22L, 256d | 99% | 1.0× | 33% | | 22L, 64d | 98% | 1.0× | 8% | | 6L, 768d | 56% | 3.3× | 100% | | 6L, 256d | 56% | 3.3× | 33% | ## Usage ### Basic Usage (Sentence Transformers) ```python from sentence_transformers import SentenceTransformer # Load model model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka") # Encode sentences sentences = [ "This is a test sentence.", "这是一个测试句子。", "Dies ist ein Testsatz.", ] embeddings = model.encode(sentences) print(embeddings.shape) # (3, 768) ``` ### Matryoshka Dimension Reduction ```python import torch.nn.functional as F # Encode with full dimensions embeddings = model.encode(sentences, convert_to_tensor=True) # Truncate to smaller dimension (e.g., 256) embeddings_256d = embeddings[:, :256] embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1) # Or truncate to 64 dimensions for maximum compression embeddings_64d = embeddings[:, :64] embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1) ``` ### Long Context (up to 32K tokens) ```python # For long documents, set max_seq_length model.max_seq_length = 8192 # or up to 32768 long_document = "..." * 10000 # Very long text embedding = model.encode(long_document) ``` ### Layer Reduction (Advanced) For latency-critical applications, you can extract embeddings from intermediate layers: ```python from transformers import AutoModel, AutoTokenizer import torch import torch.nn.functional as F model = AutoModel.from_pretrained( "llm-semantic-router/mmbert-embed-32k-2d-matryoshka", trust_remote_code=True, output_hidden_states=True ) tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka") # Encode inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # Use layer 6 for 3.3× speedup (56% quality) hidden = outputs.hidden_states[6] hidden = model.final_norm(hidden) # Mean pooling mask = inputs["attention_mask"].unsqueeze(-1).float() pooled = (hidden * mask).sum(1) / mask.sum(1) embeddings = F.normalize(pooled, p=2, dim=1) ``` ## Evaluation Results ### MTEB Benchmark (24 tasks) | Category | Score | |----------|-------| | STS (7 tasks) | **79.3** | | Classification (6) | 62.4 | | Pair Classification (2) | 76.2 | | Reranking (2) | 64.4 | | Clustering (4) | 36.9 | | Retrieval (3) | 38.2 | | **Overall Mean** | **61.4** | ### STS Benchmark | Model | Parameters | STS Score | |-------|------------|-----------| | Qwen3-Embed-0.6B | 600M | 76.17 | | **mmBERT-Embed** | **307M** | **80.5** | | Qwen3-Embed-8B | 8B | 81.08 | ### 2D Matryoshka Quality Matrix (STS) | Layers | 768d | 256d | 64d | |--------|------|------|-----| | 22L | **80.5** | 79.9 | 78.5 | | 11L | 53.7 | 48.0 | 44.4 | | 6L | 45.2 | 45.2 | 43.5 | | 3L | 44.0 | 44.1 | 41.8 | ### Long-Context Retrieval (4K tokens) | Metric | Score | |--------|-------| | R@1 | 68.8% | | R@10 | 81.2% | | MRR | 71.9% | ### Throughput (AMD MI300X) | Layers | Throughput | Speedup | |--------|------------|---------| | 22L | 477/s | 1.0× | | 11L | 916/s | 1.9× | | 6L | 1573/s | 3.3× | | 3L | 2761/s | 5.8× | ### Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B mmBERT-Embed is significantly faster due to: 1. **Flash Attention 2** - BGE-M3 lacks FA2 (O(n) vs O(n²) attention) 2. **Encoder architecture** - Qwen3 uses decoder with causal masking 3. **Smaller model** - 307M vs 569M/600M params #### Batch Size = 1 | Seq Len | mmBERT-Embed | Qwen3-0.6B | BGE-M3 | mmBERT Speedup | |---------|--------------|------------|--------|----------------| | 512 | **17.6ms** (57/s) | 20.7ms (48/s) | 10.8ms (93/s) | 0.6× | | 1024 | **18.6ms** (54/s) | 21.2ms (47/s) | 16.3ms (61/s) | 0.9× | | 2048 | **19.5ms** (51/s) | 24.1ms (42/s) | 31.1ms (32/s) | **1.6×** | | 4096 | **21.3ms** (47/s) | 43.5ms (23/s) | 60.5ms (17/s) | **2.8×** | #### Batch Size = 8 | Seq Len | mmBERT-Embed | Qwen3-0.6B | BGE-M3 | mmBERT Speedup | |---------|--------------|------------|--------|----------------| | 512 | **21.1ms** (379/s) | 33.0ms (243/s) | 40.0ms (200/s) | **1.9×** | | 1024 | **34.5ms** (232/s) | 58.5ms (137/s) | 77.4ms (103/s) | **2.2×** | | 2048 | **65.2ms** (123/s) | 117.0ms (68/s) | 162.9ms (49/s) | **2.5×** | | 4096 | **130.7ms** (61/s) | 254.9ms (31/s) | 411.3ms (19/s) | **3.1×** | **Key insight**: The FA2 advantage grows with sequence length and batch size: - At short sequences (512), BGE-M3 is faster (no FA2 overhead) - At 2K+ tokens, mmBERT pulls ahead significantly - At 4K batch=8: mmBERT is **3.1× faster** than BGE-M3 *Benchmarked on AMD MI300X, bf16 precision.* ## Training ### Data Trained on [BAAI/bge-m3-data](https://huggingface.co/datasets/BAAI/bge-m3-data) (73GB, 279 JSONL files) with: - Multilingual triplets (query, positive, negative) - Diverse domains and languages ### Configuration - **Base Model**: [llm-semantic-router/mmbert-32k-yarn](https://huggingface.co/llm-semantic-router/mmbert-32k-yarn) - **Loss**: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss) - **Matryoshka Dimensions**: [768, 512, 256, 128, 64] - **Epochs**: 1 - **Batch Size**: 16 (effective 32 with gradient accumulation) - **Learning Rate**: 2e-5 - **Max Sequence Length**: 32,768 - **Hardware**: AMD Instinct MI300X ## Use Cases ### When to Use mmBERT-Embed 1. **Multilingual RAG** for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3) 2. **Long-document retrieval** where chunking loses cross-section relationships 3. **Edge deployment** where 307M params matters vs 600M+ 4. **Flexible inference** where you need to trade quality for speed/storage at runtime ### When to Use Alternatives - **Maximum quality on major languages**: Qwen3-Embed-8B - **Production stability**: BGE-M3 (more battle-tested) - **Very short texts only**: Smaller models may suffice ## Limitations - Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable - MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility - Optimized for retrieval tasks; may need fine-tuning for other downstream tasks ## Citation ```bibtex @misc{mmbert-embed-2d-matryoshka, title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training}, author={vLLM Semantic Router Team}, year={2025}, url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka} } ``` ## ONNX Models for Production Deployment Pre-exported ONNX models are available for production deployment with ONNX Runtime. Each layer model enables true early-exit speedup. ### Available Models | Layer | Size | Latency | Throughput | Speedup | Quality | |-------|------|---------|------------|---------|---------| | `onnx/layer-6` | 454 MB | **2.56ms** | **390/sec** | **4.44×** | ~56% | | `onnx/layer-11` | 505 MB | 4.87ms | 205/sec | 2.33× | ~75% | | `onnx/layer-16` | 555 MB | 7.64ms | 131/sec | 1.49× | ~90% | | `onnx/layer-22` | 616 MB | 11.37ms | 88/sec | 1.0× | 100% | *Benchmarked on AMD MI300X with ROCm, fp16 precision, batch=1, dynamic sequence length.* ### Batch Performance (batch=8) | Layer | Throughput | Speedup | |-------|------------|---------| | 6 | **634/sec** | **2.97×** | | 11 | 428/sec | 2.00× | | 16 | 286/sec | 1.34× | | 22 | 214/sec | 1.0× | ### Download ONNX Models ```python from huggingface_hub import hf_hub_download # Download layer-6 for fast inference model_path = hf_hub_download( "llm-semantic-router/mmbert-embed-32k-2d-matryoshka", "onnx/layer-6/model.onnx" ) data_path = hf_hub_download( "llm-semantic-router/mmbert-embed-32k-2d-matryoshka", "onnx/layer-6/model.onnx.data" ) ``` ### Usage with ONNX Runtime (Python) ```python import onnxruntime as ort from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( "llm-semantic-router/mmbert-embed-32k-2d-matryoshka" ) # Load ONNX model (use ROCMExecutionProvider for AMD GPU) session = ort.InferenceSession( "onnx/layer-6/model.onnx", providers=["ROCMExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"] ) # Inference inputs = tokenizer("Hello world", return_tensors="np", padding=True) outputs = session.run(None, { "input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], }) hidden_state = outputs[0] # Shape: (batch, seq_len, 768) # Mean pooling import numpy as np mask = inputs["attention_mask"][..., np.newaxis] embeddings = (hidden_state * mask).sum(1) / mask.sum(1) embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True) ``` ### Usage with Rust (ort crate) ```rust use ort::{Session, execution_providers::ROCmExecutionProvider}; // Load models at startup let fast_model = Session::builder()? .with_execution_providers([ROCmExecutionProvider::default().build()])? .commit_from_file("onnx/layer-6/model.onnx")?; let full_model = Session::builder()? .with_execution_providers([ROCmExecutionProvider::default().build()])? .commit_from_file("onnx/layer-22/model.onnx")?; // Runtime selection based on latency/quality needs let embedding = if need_fast_response { fast_model.run(inputs)? // ~2.6ms } else { full_model.run(inputs)? // ~11ms }; ``` ### Recommended Layer Selection | Use Case | Layer | Why | |----------|-------|-----| | Real-time routing/classification | 6 | Lowest latency (2.56ms) | | Balanced speed/quality | 11 | Good tradeoff (4.87ms) | | High accuracy tasks | 16 | Near-full quality (7.64ms) | | Search/RAG | 22 | Maximum quality (11.37ms) | ### Why Separate ONNX Models? Unlike PyTorch where you can use `output_hidden_states=True` for runtime layer selection, ONNX graphs are **static DAGs** - all nodes execute regardless of which output you read. Separate model files are required for true early-exit speedup. ## License Apache 2.0