Add ONNX models documentation and benchmark results

38ba9e6 verified about 1 month ago

11.8 kB

library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - embeddings
  - multilingual
  - matryoshka
  - 2d-matryoshka
  - long-context
  - modernbert
base_model: llm-semantic-router/mmbert-32k-yarn
datasets:
  - BAAI/bge-m3-data
language:
  - multilingual
license: apache-2.0
pipeline_tag: sentence-similarity
model-index:
  - name: mmbert-embed-32k-2d-matryoshka
    results:
      - task:
          type: STS
        dataset:
          name: STS Benchmark
          type: mteb/stsbenchmark-sts
        metrics:
          - type: spearman
            value: 80.5

mmBERT-Embed-32K-2D-Matryoshka

A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.

Model Highlights

Feature	Value
Parameters	307M
Context Length	32,768 tokens
Languages	1800+ (via Glot500)
Embedding Dim	768 (supports 64-768 via Matryoshka)
Architecture	ModernBERT encoder with YaRN scaling

Key Results

Metric	Score
MTEB Mean (24 tasks)	61.4
STS Benchmark	80.5 (exceeds Qwen3-0.6B's 76.17)
Dimension Retention	99% @ 256d, 98% @ 64d
Layer Speedup	3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3	1.6-3.1× faster (FA2 advantage)

What is 2D Matryoshka?

This model supports two dimensions of flexibility:

Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference

Config	Quality	Speedup	Storage
22L, 768d	100%	1.0×	100%
22L, 256d	99%	1.0×	33%
22L, 64d	98%	1.0×	8%
6L, 768d	56%	3.3×	100%
6L, 256d	56%	3.3×	33%

Usage

Basic Usage (Sentence Transformers)

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode sentences
sentences = [
    "This is a test sentence.",
    "这是一个测试句子。",
    "Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

Matryoshka Dimension Reduction

import torch.nn.functional as F

# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)

# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)

# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)

Long Context (up to 32K tokens)

# For long documents, set max_seq_length
model.max_seq_length = 8192  # or up to 32768

long_document = "..." * 10000  # Very long text
embedding = model.encode(long_document)

Layer Reduction (Advanced)

For latency-critical applications, you can extract embeddings from intermediate layers:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model = AutoModel.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    trust_remote_code=True,
    output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
    # Use layer 6 for 3.3× speedup (56% quality)
    hidden = outputs.hidden_states[6]
    hidden = model.final_norm(hidden)
    
    # Mean pooling
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1)
    embeddings = F.normalize(pooled, p=2, dim=1)

Evaluation Results

MTEB Benchmark (24 tasks)

Category	Score
STS (7 tasks)	79.3
Classification (6)	62.4
Pair Classification (2)	76.2
Reranking (2)	64.4
Clustering (4)	36.9
Retrieval (3)	38.2
Overall Mean	61.4

STS Benchmark

Model	Parameters	STS Score
Qwen3-Embed-0.6B	600M	76.17
mmBERT-Embed	307M	80.5
Qwen3-Embed-8B	8B	81.08

2D Matryoshka Quality Matrix (STS)

Layers	768d	256d	64d
22L	80.5	79.9	78.5
11L	53.7	48.0	44.4
6L	45.2	45.2	43.5
3L	44.0	44.1	41.8

Long-Context Retrieval (4K tokens)

Metric	Score
R@1	68.8%
R@10	81.2%
MRR	71.9%

Throughput (AMD MI300X)

Layers	Throughput	Speedup
22L	477/s	1.0×
11L	916/s	1.9×
6L	1573/s	3.3×
3L	2761/s	5.8×

Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B

mmBERT-Embed is significantly faster due to:

Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
Encoder architecture - Qwen3 uses decoder with causal masking
Smaller model - 307M vs 569M/600M params

Batch Size = 1

Seq Len	mmBERT-Embed	Qwen3-0.6B	BGE-M3	mmBERT Speedup
512	17.6ms (57/s)	20.7ms (48/s)	10.8ms (93/s)	0.6×
1024	18.6ms (54/s)	21.2ms (47/s)	16.3ms (61/s)	0.9×
2048	19.5ms (51/s)	24.1ms (42/s)	31.1ms (32/s)	1.6×
4096	21.3ms (47/s)	43.5ms (23/s)	60.5ms (17/s)	2.8×

Batch Size = 8

Seq Len	mmBERT-Embed	Qwen3-0.6B	BGE-M3	mmBERT Speedup
512	21.1ms (379/s)	33.0ms (243/s)	40.0ms (200/s)	1.9×
1024	34.5ms (232/s)	58.5ms (137/s)	77.4ms (103/s)	2.2×
2048	65.2ms (123/s)	117.0ms (68/s)	162.9ms (49/s)	2.5×
4096	130.7ms (61/s)	254.9ms (31/s)	411.3ms (19/s)	3.1×

Key insight: The FA2 advantage grows with sequence length and batch size:

At short sequences (512), BGE-M3 is faster (no FA2 overhead)
At 2K+ tokens, mmBERT pulls ahead significantly
At 4K batch=8: mmBERT is 3.1× faster than BGE-M3

Benchmarked on AMD MI300X, bf16 precision.

Training

Data

Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:

Multilingual triplets (query, positive, negative)
Diverse domains and languages

Configuration

Base Model: llm-semantic-router/mmbert-32k-yarn
Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
Matryoshka Dimensions: [768, 512, 256, 128, 64]
Epochs: 1
Batch Size: 16 (effective 32 with gradient accumulation)
Learning Rate: 2e-5
Max Sequence Length: 32,768
Hardware: AMD Instinct MI300X

Use Cases

When to Use mmBERT-Embed

Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
Long-document retrieval where chunking loses cross-section relationships
Edge deployment where 307M params matters vs 600M+
Flexible inference where you need to trade quality for speed/storage at runtime

When to Use Alternatives

Maximum quality on major languages: Qwen3-Embed-8B
Production stability: BGE-M3 (more battle-tested)
Very short texts only: Smaller models may suffice

Limitations

Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
Optimized for retrieval tasks; may need fine-tuning for other downstream tasks

Citation

@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}

ONNX Models for Production Deployment

Pre-exported ONNX models are available for production deployment with ONNX Runtime. Each layer model enables true early-exit speedup.

Available Models

Layer	Size	Latency	Throughput	Speedup	Quality
`onnx/layer-6`	454 MB	2.56ms	390/sec	4.44×	~56%
`onnx/layer-11`	505 MB	4.87ms	205/sec	2.33×	~75%
`onnx/layer-16`	555 MB	7.64ms	131/sec	1.49×	~90%
`onnx/layer-22`	616 MB	11.37ms	88/sec	1.0×	100%

Benchmarked on AMD MI300X with ROCm, fp16 precision, batch=1, dynamic sequence length.

Batch Performance (batch=8)

Layer	Throughput	Speedup
6	634/sec	2.97×
11	428/sec	2.00×
16	286/sec	1.34×
22	214/sec	1.0×

Download ONNX Models

from huggingface_hub import hf_hub_download

# Download layer-6 for fast inference
model_path = hf_hub_download(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    "onnx/layer-6/model.onnx"
)
data_path = hf_hub_download(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    "onnx/layer-6/model.onnx.data"
)

Usage with ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka"
)

# Load ONNX model (use ROCMExecutionProvider for AMD GPU)
session = ort.InferenceSession(
    "onnx/layer-6/model.onnx",
    providers=["ROCMExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Inference
inputs = tokenizer("Hello world", return_tensors="np", padding=True)
outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
})
hidden_state = outputs[0]  # Shape: (batch, seq_len, 768)

# Mean pooling
import numpy as np
mask = inputs["attention_mask"][..., np.newaxis]
embeddings = (hidden_state * mask).sum(1) / mask.sum(1)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

Usage with Rust (ort crate)

use ort::{Session, execution_providers::ROCmExecutionProvider};

// Load models at startup
let fast_model = Session::builder()?
    .with_execution_providers([ROCmExecutionProvider::default().build()])?
    .commit_from_file("onnx/layer-6/model.onnx")?;

let full_model = Session::builder()?
    .with_execution_providers([ROCmExecutionProvider::default().build()])?
    .commit_from_file("onnx/layer-22/model.onnx")?;

// Runtime selection based on latency/quality needs
let embedding = if need_fast_response {
    fast_model.run(inputs)?   // ~2.6ms
} else {
    full_model.run(inputs)?   // ~11ms
};

Recommended Layer Selection

Use Case	Layer	Why
Real-time routing/classification	6	Lowest latency (2.56ms)
Balanced speed/quality	11	Good tradeoff (4.87ms)
High accuracy tasks	16	Near-full quality (7.64ms)
Search/RAG	22	Maximum quality (11.37ms)

Why Separate ONNX Models?

Unlike PyTorch where you can use output_hidden_states=True for runtime layer selection, ONNX graphs are static DAGs - all nodes execute regardless of which output you read. Separate model files are required for true early-exit speedup.

License

Apache 2.0