jina-embeddings-v4-mlx-8bit

8-bit quantized MLX port of jina-embeddings-v4 for Apple Silicon.

2.57x faster inference, 48% less memory vs the original PyTorch model.

	PyTorch (bf16)	MLX (8-bit)
Text throughput	11.4 samples/sec	30.3 samples/sec
Image throughput	9.1 samples/sec	22.3 samples/sec
VRAM	8.1 GB	4.2 GB
Precision (cosine vs original)	-	0.998

Benchmarked on M3 Ultra, single sample, sequence length 128.

Installation

pip install mlx mlx-lm numpy

Usage

import mlx.core as mx
from huggingface_hub import snapshot_download
from transformers import AutoProcessor

# Download and load
model_dir = snapshot_download("jinaai/jina-embeddings-v4-mlx-8bit")

import sys; sys.path.insert(0, model_dir)
from load_model import load_mlx_model
model = load_mlx_model(model_dir)

# Tokenizer (uses the original v4 processor)
processor = AutoProcessor.from_pretrained("jinaai/jina-embeddings-v4")

# --- Text embedding ---
texts = ["What is machine learning?", "Machine learning is a branch of AI."]
inputs = processor(
    text=["<|im_start|>user\n" + t + "<|im_end|>" for t in texts],
    return_tensors="np",
    padding=True,
    truncation=True,
    max_length=512,
)
text_emb = model.encode_text(
    input_ids=mx.array(inputs["input_ids"]),
    attention_mask=mx.array(inputs["attention_mask"]),
    task="retrieval",
)
mx.eval(text_emb)
print(f"Text embeddings: {text_emb.shape}")  # (2, 2048)

# --- Image embedding ---
from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

img_inputs = processor(
    text=["<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|>"],
    images=[image],
    return_tensors="np",
    padding=True,
)
image_emb = model.encode_image(
    input_ids=mx.array(img_inputs["input_ids"]),
    pixel_values=mx.array(img_inputs["pixel_values"].reshape(-1, img_inputs["pixel_values"].shape[-1])),
    image_grid_thw=[tuple(r) for r in img_inputs["image_grid_thw"]],
    attention_mask=mx.array(img_inputs["attention_mask"]),
    task="retrieval",
)
mx.eval(image_emb)
print(f"Image embedding: {image_emb.shape}")  # (1, 2048)

What's included

model.py - Pure MLX model (vision encoder, text encoder, M-RoPE, compiled forward passes)
load_model.py - Weight loading with automatic LoRA fusion, 8-bit quantization, QKV/MLP fusion, and mx.compile
config.json - Model configuration
model-*.safetensors - Pre-quantized weights (3.88 GB)

Optimizations

Applied automatically during loading:

LoRA adapter fusion (retrieval task baked in)
8-bit quantization (group_size=128)
QKV projection fusion (3 matmuls to 1 per layer)
Gate/up MLP fusion (2 matmuls to 1 per layer)
mx.compile on text and vision forward passes
mx.fast.scaled_dot_product_attention and mx.fast.rope
GPU stream prefetching for pipelined text/image encoding

Precision

Cosine similarity vs PyTorch original outputs:

Text: 0.998
Image: 0.998

Threshold: 0.990. All optimizations validated against frozen reference embeddings.

Limitations

LoRA task is fused at load time (currently retrieval only). Other tasks (text-matching, code) require re-loading with a different task parameter.
Requires Apple Silicon Mac with MLX support.

Citation

@misc{günther2025jinaembeddingsv4,
    title={jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval}, 
    author={Michael Günther and Saba Sturua and Mohammad Kalim Akram and Isabelle Mohr and Andrei Ungureanu and Sedigheh Eslami and Scott Martens and Bo Wang and Nan Wang and Han Xiao},
    year={2025},
    eprint={2506.18902},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2506.18902}, 
}

Downloads last month: 1,037

Safetensors

Model size

1B params

Tensor type

F16

U32

F32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jinaai/jina-embeddings-v4-mlx-8bit

Base model

jinaai/jina-embeddings-v4

Finetuned

(1)

this model

Collection including jinaai/jina-embeddings-v4-mlx-8bit

jina-embeddings-v4

Collection

Universal Embeddings for Multimodal Multilingual Retrieval • 11 items • Updated 25 days ago • 4

Paper for jinaai/jina-embeddings-v4-mlx-8bit

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

Paper • 2506.18902 • Published Jun 23, 2025 • 12