jina-embeddings-v4
Collection
Universal Embeddings for Multimodal Multilingual Retrieval • 11 items • Updated • 4
8-bit quantized MLX port of jina-embeddings-v4 for Apple Silicon.
2.57x faster inference, 48% less memory vs the original PyTorch model.
| PyTorch (bf16) | MLX (8-bit) | |
|---|---|---|
| Text throughput | 11.4 samples/sec | 30.3 samples/sec |
| Image throughput | 9.1 samples/sec | 22.3 samples/sec |
| VRAM | 8.1 GB | 4.2 GB |
| Precision (cosine vs original) | - | 0.998 |
Benchmarked on M3 Ultra, single sample, sequence length 128.
pip install mlx mlx-lm numpy
import mlx.core as mx
from huggingface_hub import snapshot_download
from transformers import AutoProcessor
# Download and load
model_dir = snapshot_download("jinaai/jina-embeddings-v4-mlx-8bit")
import sys; sys.path.insert(0, model_dir)
from load_model import load_mlx_model
model = load_mlx_model(model_dir)
# Tokenizer (uses the original v4 processor)
processor = AutoProcessor.from_pretrained("jinaai/jina-embeddings-v4")
# --- Text embedding ---
texts = ["What is machine learning?", "Machine learning is a branch of AI."]
inputs = processor(
text=["<|im_start|>user\n" + t + "<|im_end|>" for t in texts],
return_tensors="np",
padding=True,
truncation=True,
max_length=512,
)
text_emb = model.encode_text(
input_ids=mx.array(inputs["input_ids"]),
attention_mask=mx.array(inputs["attention_mask"]),
task="retrieval",
)
mx.eval(text_emb)
print(f"Text embeddings: {text_emb.shape}") # (2, 2048)
# --- Image embedding ---
from PIL import Image
import requests
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
img_inputs = processor(
text=["<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|>"],
images=[image],
return_tensors="np",
padding=True,
)
image_emb = model.encode_image(
input_ids=mx.array(img_inputs["input_ids"]),
pixel_values=mx.array(img_inputs["pixel_values"].reshape(-1, img_inputs["pixel_values"].shape[-1])),
image_grid_thw=[tuple(r) for r in img_inputs["image_grid_thw"]],
attention_mask=mx.array(img_inputs["attention_mask"]),
task="retrieval",
)
mx.eval(image_emb)
print(f"Image embedding: {image_emb.shape}") # (1, 2048)
model.py - Pure MLX model (vision encoder, text encoder, M-RoPE, compiled forward passes)load_model.py - Weight loading with automatic LoRA fusion, 8-bit quantization, QKV/MLP fusion, and mx.compileconfig.json - Model configurationmodel-*.safetensors - Pre-quantized weights (3.88 GB)Applied automatically during loading:
mx.compile on text and vision forward passesmx.fast.scaled_dot_product_attention and mx.fast.ropeCosine similarity vs PyTorch original outputs:
Threshold: 0.990. All optimizations validated against frozen reference embeddings.
retrieval only). Other tasks (text-matching, code) require re-loading with a different task parameter.@misc{günther2025jinaembeddingsv4,
title={jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval},
author={Michael Günther and Saba Sturua and Mohammad Kalim Akram and Isabelle Mohr and Andrei Ungureanu and Sedigheh Eslami and Scott Martens and Bo Wang and Nan Wang and Han Xiao},
year={2025},
eprint={2506.18902},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.18902},
}
Quantized
Base model
jinaai/jina-embeddings-v4