Gaperon Quality Classifier

Gaperon Quality Classifier is a multilingual document quality classifier based on XLM-V base, fine-tuned to assess the quality of web-crawled documents in French and English. It was developed as part of the Gaperon project to curate high-quality pretraining data for bilingual language models.

Model Details

  • Model Type: Text Classification (Document Quality)
  • Architecture: XLM-V base
  • Base Model: facebook/xlm-v-base
  • Languages: French, English
  • License: Apache 2.0
  • Developed by: ALMAnaCH team, Inria Paris
  • Output Labels: low, medium, high
  • F1 Score: 75.11%

Intended Use

This classifier is designed for:

  • Filtering large-scale web-crawled corpora for language model pretraining
  • Assessing document quality based on linguistic and content criteria
  • Sample weighting in pretraining data mixtures

Unlike educational-value classifiers (e.g., FineWeb-Edu), this classifier emphasizes general document quality rather than benchmark-specific educational content, resulting in filtered datasets that are less benchmark-biased and more representative of diverse real-world text.

Quality Criteria

The classifier was trained to evaluate documents on the following criteria:

Criterion Description
Content Accuracy Factual reliability and use of credible sources
Clarity Clear explanations, well-defined terms, logical flow
Coherence Overall organization and logical progression
Grammar and Language Correctness and audience appropriateness
Depth of Information Level of detail and comprehensiveness
Overall Usefulness Relevance and practical value for a general audience

Training Data

Annotation Process

The classifier was trained on 500,000 annotated documents:

  • 250,000 documents from RedPajama-V2-French (RPv2-Fr)
  • 250,000 documents from TxT360-CC (English)

Synthetic Labeling

Document labels were generated using Llama-3.1-70B-Instruct, prompted to evaluate each document and assign a quality label (low, medium, or high) along with a short justification. Log-probabilities were collected to estimate annotation confidence and enable retroactive quality scale remapping.

Prompt used to generate labels

Click to view full prompt
Below is an extract from a web page. Evaluate the quality of the content based on the following factors:

1. Content Accuracy: Assess the correctness and reliability of the information presented. Consider the factual accuracy, use of credible sources (if mentioned), and absence of misinformation.
2. Clarity: Evaluate how well the information is communicated. Look for clear explanations, well-defined terms, and logical flow of ideas.
3. Coherence: Analyze the overall structure and organization of the content. Consider how well ideas are connected and if the content follows a logical progression.
4. Grammar and Language: Assess the quality of writing, including correct grammar, spelling, and punctuation. Consider the appropriateness of language for the intended audience.
5. Depth of Information: Evaluate the level of detail and thoroughness of the content. Consider whether it provides surface-level information or delves into more comprehensive explanations.
6. Overall Usefulness: Assess the practical value and relevance of the information for a general audience. Consider how applicable or helpful the content would be for someone seeking information on the topic.

Based on these factors, give an overall quality score of low, medium, or high.
Additionally, select one or more domains from the list below. Each domain listed is a single, combined category. Choose the most relevant domain(s). Domain(s) can only be chosen from the list below. Only select "Other" if none of the listed domains are applicable.
- Arts
- Business & Economics & Finance
- Culture & Cultural geography
- Daily Life & Home & Lifestyle
- Education
- Entertainment & Travel & Hobby
- Environment
- Food & Drink & Cooking
- Health & Wellness & Medicine
- Law & Justice
- Natural Science & Formal Science & Technology
- Personal Development & Human Resources & Career
- Politics & Government
- Religion & Spirituality
- Shopping & Commodity
- Society & Social Issues & Human Rights
- Sports
- Other (only if none of the above are relevant)
Additionally, identify the main topic of the extract, which can be any relevant subfield. Don't elaborate on the topic; just provide a concise classification.
Additionally, identify the document type, which can be article, blog post, forum post, or any other relevant type. Don't elaborate on the type; just provide a concise classification.

USER PROMPT:
The extract:
{DOCUMENT}

After examining the extract:
- Briefly justify your quality classification, up to 100 words on one line using the format: "Explanation: <justification>"
- Conclude with the quality classification using the format: "Quality score: <classification>" (on a separate line)
- Continue with the domain classification using the format: "Domain: <classification>, <classification>, ..." (on a separate line)
- Continue with the main topic or subject classification using the format: "Main topic: <classification>" (on a separate line)
- Continue with the document type classification using the format: "Document type: <classification>" (on a separate line)

Evaluate the content based on the quality factors outlined above.

Training Procedure

Training Details

  • Task: Single-task quality classification
  • Abandoned approach: Multitask learning (quality + domain prediction) underperformed

Performance

F1 Score: 75.11%

Confusion Matrix

True \ Predicted Low Medium High
Low 922 463 77
Medium 203 5,219 623
High 32 531 1,930

Most errors occur between adjacent labels (e.g., medium vs. high/low), while confusion between extreme categories (high vs. low) is limited.

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="almanach/gaperon-quality-classifier")
documents = ["Your document text goes here."]
results = classifier(documents)
for result in results:
    print(f"Label: {result['label']}, Score: {result['score']}")

Deploying with a MiGraphX Inference Server is also supported for optimized performance.

Inference Server Code
import asyncio
import json
import logging
import os
import time
from ast import literal_eval
from typing import Dict, List, Optional

import migraphx as mgx
import numpy as np
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer

MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", 512))
label_list = os.getenv("LABEL_LIST", "")
if not label_list:
    raise ValueError("LABEL_LIST environment variable is required")
elif "json" in label_list:
    # laoding from config file
    id2label = json.loads(label_list)["id2label"]
    # convert keys to int
    id2label = {int(k): v for k, v in id2label.items()}
    # list sorted by key
    label_list = [id2label[i] for i in sorted(id2label.keys())]
else:
    label_list = label_list.split(",")

assert len(label_list) > 0, "LABEL_LIST environment variable is required"
print(f"Label list: {label_list}")

MODEL_PATH = os.getenv("MODEL_PATH", None)
assert MODEL_PATH is not None, "MODEL_PATH environment variable is required"
TOKENIZER_PATH = os.getenv("TOKENIZER_PATH", None)
assert TOKENIZER_PATH is not None, "TOKENIZER_PATH environment variable is required"


model = mgx.load(MODEL_PATH, format="msgpack")
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

LOGGING_CONFIG = {
    "version": 1,
    "disable_existing_loggers": True,
    "formatters": {
        "standard": {
            "format": "%(process)d %(asctime)s [%(levelname)s] %(name)s: %(message)s"
        },
    },
    "handlers": {
        "default": {
            "level": "INFO",
            "formatter": "standard",
            "class": "logging.StreamHandler",
            "stream": "ext://sys.stdout",  # Default is stderr
        },
    },
    "loggers": {
        "": {  # root logger
            "level": "INFO",  # "INFO",
            "handlers": ["default"],
            "propagate": False,
        },
        "uvicorn.error": {
            "level": "DEBUG",
            "handlers": ["default"],
        },
        "uvicorn.access": {
            "level": "WARNING",
            "handlers": ["default"],
        },
    },
}

logging.config.dictConfig(LOGGING_CONFIG)

logger = logging.getLogger(__name__)
logger.info("Starting FastAPI server...")
logger.info(f"Model path: {MODEL_PATH}")
logger.info(f"Tokenizer path: {TOKENIZER_PATH}")
logger.info(f"Label list: {label_list}")
app = FastAPI()


class InputData(BaseModel):
    text: str


# Update BatchInputData model
class BatchInputData(BaseModel):
    texts: Optional[List[str]] = None
    input_ids: Optional[List[List[int]]] = None
    attention_mask: Optional[List[List[int]]] = None
    token_type_ids: Optional[List[List[int]]] = None
    is_pre_tokenized: bool = False


class LabelScore(BaseModel):
    label: str
    score: float


class BatchOutputData(BaseModel):
    results: List[List[LabelScore]]


def softmax(_outputs, axis=-1):
    maxes = np.max(_outputs, axis=axis, keepdims=True)
    shifted_exp = np.exp(_outputs - maxes)
    return shifted_exp / shifted_exp.sum(axis=axis, keepdims=True)


# Asynchronous function to tokenize the batch
async def tokenize_batch(texts):
    tokenized_batch = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=512,
        return_tensors="np",
        return_attention_mask=True,
        return_token_type_ids=True,
    )
    return {
        "input_ids": tokenized_batch["input_ids"],
        "attention_mask": tokenized_batch["attention_mask"],
        "token_type_ids": tokenized_batch["token_type_ids"],
    }


# Function to run model inference (blocking)
def run_inference(batch):
    logits = np.array(model.run(batch)).reshape(-1, len(label_list))
    return softmax(logits, axis=-1)


# Queues for tokenization and inference
tokenization_queue = asyncio.Queue()
inference_queue = asyncio.Queue()


# Consumer for inference
async def inference_consumer():
    while True:
        tokenized_batch, result_future = await inference_queue.get()
        try:
            # async with inference_semaphore:
            # Run inference on the GPU
            result = run_inference(tokenized_batch)

            result_future.set_result(result)  # Set the result for the future
        except Exception as e:
            result_future.set_exception(e)
        finally:
            inference_queue.task_done()


# Consumer for tokenization
async def tokenization_consumer():
    while True:
        texts, result_future = await tokenization_queue.get()
        try:
            # async with tokenization_semaphore:
            # Tokenize the batch asynchronously (CPU task)
            tokenized_batch = await tokenize_batch(texts)

            # Once tokenized, queue for inference (GPU task)
            await inference_queue.put((tokenized_batch, result_future))
        except Exception as e:
            result_future.set_exception(e)
        finally:
            tokenization_queue.task_done()


# Background tasks for tokenization and inference consumers
# Define semaphores for tokenization and inference
# tokenization_semaphore = asyncio.Semaphore(10)  # Limit to 5 concurrent tokenizations
# inference_semaphore = asyncio.Semaphore(5)  # Limit to 5 concurrent inferences


@app.on_event("startup")
async def startup_event():
    asyncio.create_task(tokenization_consumer())
    asyncio.create_task(inference_consumer())


@app.post("/label")
async def label_text(data: BatchInputData):
    if data.is_pre_tokenized:
        # Validate pre-tokenized inputs
        if not all([data.input_ids, data.attention_mask, data.token_type_ids]):
            raise HTTPException(
                status_code=400,
                detail="When is_pre_tokenized is True, input_ids, attention_mask, and token_type_ids are required.",
            )

        # Ensure batch sizes are consistent
        batch_size = len(data.input_ids)
        if any(
            len(lst) != batch_size for lst in [data.attention_mask, data.token_type_ids]
        ):
            raise HTTPException(
                status_code=400,
                detail="All pre-tokenized inputs (input_ids, attention_mask, token_type_ids) must have the same batch size.",
            )

        # Package the pre-tokenized inputs for inference
        tokenized_batch = {
            "input_ids": np.array(data.input_ids, dtype=np.int64),
            "attention_mask": np.array(data.attention_mask, dtype=np.int64),
            "token_type_ids": np.array(data.token_type_ids, dtype=np.int64),
        }

        # Create a future for inference
        result_future = asyncio.get_event_loop().create_future()

        # Directly add the pre-tokenized data to the inference queue
        await inference_queue.put((tokenized_batch, result_future))

    else:
        # Validate and process texts for tokenization
        if not data.texts:
            raise HTTPException(
                status_code=400,
                detail="Texts field is required when is_pre_tokenized is False.",
            )

        if len(data.texts) > MAX_BATCH_SIZE:
            raise HTTPException(
                status_code=400, detail=f"Batch size is too large (> {MAX_BATCH_SIZE})"
            )

        # Create a future for tokenization and inference
        result_future = asyncio.get_event_loop().create_future()

        # Add the texts to the tokenization queue
        await tokenization_queue.put((data.texts, result_future))

    # Wait for the future result to be set (after tokenization and/or inference completes)
    predictions = await result_future

    # Process the results into the desired format
    results = [
        [LabelScore(label=label, score=score) for label, score in zip(label_list, pred)]
        for pred in predictions
    ]
    # Sort the results by score
    results = [
        sorted(result, key=lambda x: x.score, reverse=True) for result in results
    ]

    return {"results": results}


@app.get("/health")
def health():
    # check if current SLURM job is ending soon
    slurm_job_end_time = os.getenv("SLURM_JOB_END_TIME", None)
    if slurm_job_end_time is not None:
        slurm_job_end_time = int(slurm_job_end_time)
        if slurm_job_end_time - time.time() < 300:
            return {"status": "ending"}

    return {"status": "ok"}


@app.get("/get_job_info")
def get_job_info():
    job_info = {}
    for key in os.environ:
        if key.startswith("SLURM_"):
            job_info[key] = os.getenv(key)
    return job_info


# run with
if __name__ == "__main__":
    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

Dockerfile for inference server:

FROM rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1

ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime
ARG ONNXRUNTIME_BRANCH=v1.17.3

ENV PATH /code/cmake-3.27.3-linux-x86_64/bin:${PATH}

RUN apt-get update &&\
    apt-get install -y migraphx

WORKDIR /install_dir

# Prepare onnxruntime repository & build onnxruntime
RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
    /bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\
    cd onnxruntime  && pip install --upgrade pip &&\
    /bin/sh ./build.sh --allow_running_as_root --cmake_extra_defines ONNXRUNTIME_VERSION=`cat ./VERSION_NUMBER` --config Release --parallel \
    --skip_tests --build_wheel --use_rocm --rocm_version=${ROCM_VERSION} --rocm_home /opt/rocm --use_migraphx && \
    pip install /install_dir/onnxruntime/build/Linux/Release/dist/*.whl

RUN pip install --upgrade --upgrade-strategy eager optimum[amd]==1.22.0 fastapi[standard]

WORKDIR /workspace

Limitations

  • Sequence length: Documents are truncated to 512 tokens; quality assessment is based on the beginning of documents only
  • Language scope: Optimized for French and English; performance on other languages not evaluated
  • Subjectivity: Quality labels are synthetic, generated by an LLM, which may introduce biases from the teacher model

Related Models

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite},
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771},
}

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.

Downloads last month
20
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for almanach/gaperon-quality-classifier

Quantized
(1)
this model

Datasets used to train almanach/gaperon-quality-classifier