Arabic Legal Documents OCR 1.0 (VLM Finetuned)

Watch the Full 3.5-Hour Masterclass on YouTube

This model is a finetuned version of Gemma-3-4B-IT, optimized for extracting structured data from low-quality, scanned Arabic legal documents using Vision Language Model reasoning.

🛠 Installation

Depending on your usage (Local Inference vs. Production Serving), install the required packages:

For Transformers (Local Inference)

pip install transformers==4.57.6 optimum==1.26.0 accelerate==1.8.0 peft==0.17.0 json-repair PIL

For vLLM (High-Performance Serving)

!pip install -q transformers==4.57.6
!pip install -q optimum==1.26.0
!pip install -q datasets==4.4.0

!pip install -q torch==2.8.0
!pip install -q torchvision==0.23
!pip install -q torchaudio==2.8.0

!pip install -q vllm==0.15.0
!pip install json-repair

🖼 Mandatory Image Preprocessing

To achieve the best OCR results, images must be preprocessed (resized and converted to grayscale) before being sent to the model. Below are the utility functions for both standard PIL usage and Base64 (vLLM/OpenAI API).

import base64
from io import BytesIO
from PIL import Image, ImageEnhance

def preprocess_image(image_path, max_width=1024, do_enhance=True, return_base64=False):
    image = Image.open(image_path)
    
    # 1. Convert to grayscale
    gray_image = image.convert('L')
    
    # 2. Resize maintaining aspect ratio
    if gray_image.width > max_width:
        ratio = max_width / float(gray_image.width)
        new_height = int(gray_image.height * ratio)
        gray_image = gray_image.resize((max_width, new_height), Image.LANCZOS)

    # 3. Enhance contrast
    if do_enhance:
        enhancer = ImageEnhance.Contrast(gray_image)
        gray_image = enhancer.enhance(1.5)

    if return_base64:
        buffered = BytesIO()
        gray_image.save(buffered, format="JPEG", optimize=True, quality=95)
        img_str = base64.b64encode(buffered.getvalue()).decode('utf-8')
        return f"data:image/jpeg;base64,{img_str}"
    
    return gray_image

🚀 Usage Examples

1. Using Transformers & json-repair

import json_repair
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model_id = "bakrianoo/arabic-legal-documents-ocr-1.0"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

# Preprocess image first
processed_img = preprocess_image("document.jpg", return_base64=False)

messages = [
    {"role": "user", "content": [{"type": "image", "image": processed_img}, {"type": "text", "text": "Extract details to JSON."}]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
raw_text = processor.decode(output[0], skip_special_tokens=True)

# Fix and parse JSON output
json_data = json_repair.loads(raw_text)
print(json_data)

2. Using vLLM API

Run vLLM server

vllm serve "bakrianoo/arabic-legal-documents-ocr-1.0" \
--dtype bfloat16 --gpu_memory_utilization 0.8 \
--enable-chunked-prefill \
--allowed-local-media-path "/workspace/"

Inference

from openai import OpenAI
import json_repair

client = OpenAI(api_key="any", base_url="http://localhost:8000/v1")

# Preprocess to Base64
b64_image = preprocess_image("document.jpg", return_base64=True)

response = client.chat.completions.create(
    model="bakrianoo/arabic-legal-documents-ocr-1.0",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": b64_image}},
        {"type": "text", "text": "Extract details to JSON."}
    ]}]
)

# Robust parsing
structured_output = json_repair.loads(response.choices[0].message.content)