jina-embeddings-v4-mlx-8bit

8-bit quantized MLX port of jina-embeddings-v4 for Apple Silicon.

2.57x faster inference, 48% less memory vs the original PyTorch model.

PyTorch (bf16) MLX (8-bit)
Text throughput 11.4 samples/sec 30.3 samples/sec
Image throughput 9.1 samples/sec 22.3 samples/sec
VRAM 8.1 GB 4.2 GB
Precision (cosine vs original) - 0.998

Benchmarked on M3 Ultra, single sample, sequence length 128.

Installation

pip install mlx mlx-lm numpy

Usage

import mlx.core as mx
from huggingface_hub import snapshot_download
from transformers import AutoProcessor

# Download and load
model_dir = snapshot_download("jinaai/jina-embeddings-v4-mlx-8bit")

import sys; sys.path.insert(0, model_dir)
from load_model import load_mlx_model
model = load_mlx_model(model_dir)

# Tokenizer (uses the original v4 processor)
processor = AutoProcessor.from_pretrained("jinaai/jina-embeddings-v4")

# --- Text embedding ---
texts = ["What is machine learning?", "Machine learning is a branch of AI."]
inputs = processor(
    text=["<|im_start|>user\n" + t + "<|im_end|>" for t in texts],
    return_tensors="np",
    padding=True,
    truncation=True,
    max_length=512,
)
text_emb = model.encode_text(
    input_ids=mx.array(inputs["input_ids"]),
    attention_mask=mx.array(inputs["attention_mask"]),
    task="retrieval",
)
mx.eval(text_emb)
print(f"Text embeddings: {text_emb.shape}")  # (2, 2048)

# --- Image embedding ---
from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

img_inputs = processor(
    text=["<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|>"],
    images=[image],
    return_tensors="np",
    padding=True,
)
image_emb = model.encode_image(
    input_ids=mx.array(img_inputs["input_ids"]),
    pixel_values=mx.array(img_inputs["pixel_values"].reshape(-1, img_inputs["pixel_values"].shape[-1])),
    image_grid_thw=[tuple(r) for r in img_inputs["image_grid_thw"]],
    attention_mask=mx.array(img_inputs["attention_mask"]),
    task="retrieval",
)
mx.eval(image_emb)
print(f"Image embedding: {image_emb.shape}")  # (1, 2048)

What's included

  • model.py - Pure MLX model (vision encoder, text encoder, M-RoPE, compiled forward passes)
  • load_model.py - Weight loading with automatic LoRA fusion, 8-bit quantization, QKV/MLP fusion, and mx.compile
  • config.json - Model configuration
  • model-*.safetensors - Pre-quantized weights (3.88 GB)

Optimizations

Applied automatically during loading:

  1. LoRA adapter fusion (retrieval task baked in)
  2. 8-bit quantization (group_size=128)
  3. QKV projection fusion (3 matmuls to 1 per layer)
  4. Gate/up MLP fusion (2 matmuls to 1 per layer)
  5. mx.compile on text and vision forward passes
  6. mx.fast.scaled_dot_product_attention and mx.fast.rope
  7. GPU stream prefetching for pipelined text/image encoding

Precision

Cosine similarity vs PyTorch original outputs:

  • Text: 0.998
  • Image: 0.998

Threshold: 0.990. All optimizations validated against frozen reference embeddings.

Limitations

  • LoRA task is fused at load time (currently retrieval only). Other tasks (text-matching, code) require re-loading with a different task parameter.
  • Requires Apple Silicon Mac with MLX support.

Citation

@misc{günther2025jinaembeddingsv4,
    title={jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval}, 
    author={Michael Günther and Saba Sturua and Mohammad Kalim Akram and Isabelle Mohr and Andrei Ungureanu and Sedigheh Eslami and Scott Martens and Bo Wang and Nan Wang and Han Xiao},
    year={2025},
    eprint={2506.18902},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2506.18902}, 
}
Downloads last month
1,037
Safetensors
Model size
1B params
Tensor type
F16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jinaai/jina-embeddings-v4-mlx-8bit

Finetuned
(1)
this model

Collection including jinaai/jina-embeddings-v4-mlx-8bit

Paper for jinaai/jina-embeddings-v4-mlx-8bit