Gaperon Quality Classifier
Gaperon Quality Classifier is a multilingual document quality classifier based on XLM-V base, fine-tuned to assess the quality of web-crawled documents in French and English. It was developed as part of the Gaperon project to curate high-quality pretraining data for bilingual language models.
Model Details
- Model Type: Text Classification (Document Quality)
- Architecture: XLM-V base
- Base Model: facebook/xlm-v-base
- Languages: French, English
- License: Apache 2.0
- Developed by: ALMAnaCH team, Inria Paris
- Output Labels:
low,medium,high - F1 Score: 75.11%
Intended Use
This classifier is designed for:
- Filtering large-scale web-crawled corpora for language model pretraining
- Assessing document quality based on linguistic and content criteria
- Sample weighting in pretraining data mixtures
Unlike educational-value classifiers (e.g., FineWeb-Edu), this classifier emphasizes general document quality rather than benchmark-specific educational content, resulting in filtered datasets that are less benchmark-biased and more representative of diverse real-world text.
Quality Criteria
The classifier was trained to evaluate documents on the following criteria:
| Criterion | Description |
|---|---|
| Content Accuracy | Factual reliability and use of credible sources |
| Clarity | Clear explanations, well-defined terms, logical flow |
| Coherence | Overall organization and logical progression |
| Grammar and Language | Correctness and audience appropriateness |
| Depth of Information | Level of detail and comprehensiveness |
| Overall Usefulness | Relevance and practical value for a general audience |
Training Data
Annotation Process
The classifier was trained on 500,000 annotated documents:
- 250,000 documents from RedPajama-V2-French (RPv2-Fr)
- 250,000 documents from TxT360-CC (English)
Synthetic Labeling
Document labels were generated using Llama-3.1-70B-Instruct, prompted to evaluate each document and assign a quality label (low, medium, or high) along with a short justification. Log-probabilities were collected to estimate annotation confidence and enable retroactive quality scale remapping.
Prompt used to generate labels
Click to view full prompt
Below is an extract from a web page. Evaluate the quality of the content based on the following factors:
1. Content Accuracy: Assess the correctness and reliability of the information presented. Consider the factual accuracy, use of credible sources (if mentioned), and absence of misinformation.
2. Clarity: Evaluate how well the information is communicated. Look for clear explanations, well-defined terms, and logical flow of ideas.
3. Coherence: Analyze the overall structure and organization of the content. Consider how well ideas are connected and if the content follows a logical progression.
4. Grammar and Language: Assess the quality of writing, including correct grammar, spelling, and punctuation. Consider the appropriateness of language for the intended audience.
5. Depth of Information: Evaluate the level of detail and thoroughness of the content. Consider whether it provides surface-level information or delves into more comprehensive explanations.
6. Overall Usefulness: Assess the practical value and relevance of the information for a general audience. Consider how applicable or helpful the content would be for someone seeking information on the topic.
Based on these factors, give an overall quality score of low, medium, or high.
Additionally, select one or more domains from the list below. Each domain listed is a single, combined category. Choose the most relevant domain(s). Domain(s) can only be chosen from the list below. Only select "Other" if none of the listed domains are applicable.
- Arts
- Business & Economics & Finance
- Culture & Cultural geography
- Daily Life & Home & Lifestyle
- Education
- Entertainment & Travel & Hobby
- Environment
- Food & Drink & Cooking
- Health & Wellness & Medicine
- Law & Justice
- Natural Science & Formal Science & Technology
- Personal Development & Human Resources & Career
- Politics & Government
- Religion & Spirituality
- Shopping & Commodity
- Society & Social Issues & Human Rights
- Sports
- Other (only if none of the above are relevant)
Additionally, identify the main topic of the extract, which can be any relevant subfield. Don't elaborate on the topic; just provide a concise classification.
Additionally, identify the document type, which can be article, blog post, forum post, or any other relevant type. Don't elaborate on the type; just provide a concise classification.
USER PROMPT:
The extract:
{DOCUMENT}
After examining the extract:
- Briefly justify your quality classification, up to 100 words on one line using the format: "Explanation: <justification>"
- Conclude with the quality classification using the format: "Quality score: <classification>" (on a separate line)
- Continue with the domain classification using the format: "Domain: <classification>, <classification>, ..." (on a separate line)
- Continue with the main topic or subject classification using the format: "Main topic: <classification>" (on a separate line)
- Continue with the document type classification using the format: "Document type: <classification>" (on a separate line)
Evaluate the content based on the quality factors outlined above.
Training Procedure
Training Details
- Task: Single-task quality classification
- Abandoned approach: Multitask learning (quality + domain prediction) underperformed
Performance
F1 Score: 75.11%
Confusion Matrix
| True \ Predicted | Low | Medium | High |
|---|---|---|---|
| Low | 922 | 463 | 77 |
| Medium | 203 | 5,219 | 623 |
| High | 32 | 531 | 1,930 |
Most errors occur between adjacent labels (e.g., medium vs. high/low), while confusion between extreme categories (high vs. low) is limited.
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="almanach/gaperon-quality-classifier")
documents = ["Your document text goes here."]
results = classifier(documents)
for result in results:
print(f"Label: {result['label']}, Score: {result['score']}")
Deploying with a MiGraphX Inference Server is also supported for optimized performance.
Inference Server Code
import asyncio
import json
import logging
import os
import time
from ast import literal_eval
from typing import Dict, List, Optional
import migraphx as mgx
import numpy as np
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer
MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", 512))
label_list = os.getenv("LABEL_LIST", "")
if not label_list:
raise ValueError("LABEL_LIST environment variable is required")
elif "json" in label_list:
# laoding from config file
id2label = json.loads(label_list)["id2label"]
# convert keys to int
id2label = {int(k): v for k, v in id2label.items()}
# list sorted by key
label_list = [id2label[i] for i in sorted(id2label.keys())]
else:
label_list = label_list.split(",")
assert len(label_list) > 0, "LABEL_LIST environment variable is required"
print(f"Label list: {label_list}")
MODEL_PATH = os.getenv("MODEL_PATH", None)
assert MODEL_PATH is not None, "MODEL_PATH environment variable is required"
TOKENIZER_PATH = os.getenv("TOKENIZER_PATH", None)
assert TOKENIZER_PATH is not None, "TOKENIZER_PATH environment variable is required"
model = mgx.load(MODEL_PATH, format="msgpack")
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
LOGGING_CONFIG = {
"version": 1,
"disable_existing_loggers": True,
"formatters": {
"standard": {
"format": "%(process)d %(asctime)s [%(levelname)s] %(name)s: %(message)s"
},
},
"handlers": {
"default": {
"level": "INFO",
"formatter": "standard",
"class": "logging.StreamHandler",
"stream": "ext://sys.stdout", # Default is stderr
},
},
"loggers": {
"": { # root logger
"level": "INFO", # "INFO",
"handlers": ["default"],
"propagate": False,
},
"uvicorn.error": {
"level": "DEBUG",
"handlers": ["default"],
},
"uvicorn.access": {
"level": "WARNING",
"handlers": ["default"],
},
},
}
logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)
logger.info("Starting FastAPI server...")
logger.info(f"Model path: {MODEL_PATH}")
logger.info(f"Tokenizer path: {TOKENIZER_PATH}")
logger.info(f"Label list: {label_list}")
app = FastAPI()
class InputData(BaseModel):
text: str
# Update BatchInputData model
class BatchInputData(BaseModel):
texts: Optional[List[str]] = None
input_ids: Optional[List[List[int]]] = None
attention_mask: Optional[List[List[int]]] = None
token_type_ids: Optional[List[List[int]]] = None
is_pre_tokenized: bool = False
class LabelScore(BaseModel):
label: str
score: float
class BatchOutputData(BaseModel):
results: List[List[LabelScore]]
def softmax(_outputs, axis=-1):
maxes = np.max(_outputs, axis=axis, keepdims=True)
shifted_exp = np.exp(_outputs - maxes)
return shifted_exp / shifted_exp.sum(axis=axis, keepdims=True)
# Asynchronous function to tokenize the batch
async def tokenize_batch(texts):
tokenized_batch = tokenizer(
texts,
truncation=True,
padding="max_length",
max_length=512,
return_tensors="np",
return_attention_mask=True,
return_token_type_ids=True,
)
return {
"input_ids": tokenized_batch["input_ids"],
"attention_mask": tokenized_batch["attention_mask"],
"token_type_ids": tokenized_batch["token_type_ids"],
}
# Function to run model inference (blocking)
def run_inference(batch):
logits = np.array(model.run(batch)).reshape(-1, len(label_list))
return softmax(logits, axis=-1)
# Queues for tokenization and inference
tokenization_queue = asyncio.Queue()
inference_queue = asyncio.Queue()
# Consumer for inference
async def inference_consumer():
while True:
tokenized_batch, result_future = await inference_queue.get()
try:
# async with inference_semaphore:
# Run inference on the GPU
result = run_inference(tokenized_batch)
result_future.set_result(result) # Set the result for the future
except Exception as e:
result_future.set_exception(e)
finally:
inference_queue.task_done()
# Consumer for tokenization
async def tokenization_consumer():
while True:
texts, result_future = await tokenization_queue.get()
try:
# async with tokenization_semaphore:
# Tokenize the batch asynchronously (CPU task)
tokenized_batch = await tokenize_batch(texts)
# Once tokenized, queue for inference (GPU task)
await inference_queue.put((tokenized_batch, result_future))
except Exception as e:
result_future.set_exception(e)
finally:
tokenization_queue.task_done()
# Background tasks for tokenization and inference consumers
# Define semaphores for tokenization and inference
# tokenization_semaphore = asyncio.Semaphore(10) # Limit to 5 concurrent tokenizations
# inference_semaphore = asyncio.Semaphore(5) # Limit to 5 concurrent inferences
@app.on_event("startup")
async def startup_event():
asyncio.create_task(tokenization_consumer())
asyncio.create_task(inference_consumer())
@app.post("/label")
async def label_text(data: BatchInputData):
if data.is_pre_tokenized:
# Validate pre-tokenized inputs
if not all([data.input_ids, data.attention_mask, data.token_type_ids]):
raise HTTPException(
status_code=400,
detail="When is_pre_tokenized is True, input_ids, attention_mask, and token_type_ids are required.",
)
# Ensure batch sizes are consistent
batch_size = len(data.input_ids)
if any(
len(lst) != batch_size for lst in [data.attention_mask, data.token_type_ids]
):
raise HTTPException(
status_code=400,
detail="All pre-tokenized inputs (input_ids, attention_mask, token_type_ids) must have the same batch size.",
)
# Package the pre-tokenized inputs for inference
tokenized_batch = {
"input_ids": np.array(data.input_ids, dtype=np.int64),
"attention_mask": np.array(data.attention_mask, dtype=np.int64),
"token_type_ids": np.array(data.token_type_ids, dtype=np.int64),
}
# Create a future for inference
result_future = asyncio.get_event_loop().create_future()
# Directly add the pre-tokenized data to the inference queue
await inference_queue.put((tokenized_batch, result_future))
else:
# Validate and process texts for tokenization
if not data.texts:
raise HTTPException(
status_code=400,
detail="Texts field is required when is_pre_tokenized is False.",
)
if len(data.texts) > MAX_BATCH_SIZE:
raise HTTPException(
status_code=400, detail=f"Batch size is too large (> {MAX_BATCH_SIZE})"
)
# Create a future for tokenization and inference
result_future = asyncio.get_event_loop().create_future()
# Add the texts to the tokenization queue
await tokenization_queue.put((data.texts, result_future))
# Wait for the future result to be set (after tokenization and/or inference completes)
predictions = await result_future
# Process the results into the desired format
results = [
[LabelScore(label=label, score=score) for label, score in zip(label_list, pred)]
for pred in predictions
]
# Sort the results by score
results = [
sorted(result, key=lambda x: x.score, reverse=True) for result in results
]
return {"results": results}
@app.get("/health")
def health():
# check if current SLURM job is ending soon
slurm_job_end_time = os.getenv("SLURM_JOB_END_TIME", None)
if slurm_job_end_time is not None:
slurm_job_end_time = int(slurm_job_end_time)
if slurm_job_end_time - time.time() < 300:
return {"status": "ending"}
return {"status": "ok"}
@app.get("/get_job_info")
def get_job_info():
job_info = {}
for key in os.environ:
if key.startswith("SLURM_"):
job_info[key] = os.getenv(key)
return job_info
# run with
if __name__ == "__main__":
uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
Dockerfile for inference server:
FROM rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1
ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime
ARG ONNXRUNTIME_BRANCH=v1.17.3
ENV PATH /code/cmake-3.27.3-linux-x86_64/bin:${PATH}
RUN apt-get update &&\
apt-get install -y migraphx
WORKDIR /install_dir
# Prepare onnxruntime repository & build onnxruntime
RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
/bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\
cd onnxruntime && pip install --upgrade pip &&\
/bin/sh ./build.sh --allow_running_as_root --cmake_extra_defines ONNXRUNTIME_VERSION=`cat ./VERSION_NUMBER` --config Release --parallel \
--skip_tests --build_wheel --use_rocm --rocm_version=${ROCM_VERSION} --rocm_home /opt/rocm --use_migraphx && \
pip install /install_dir/onnxruntime/build/Linux/Release/dist/*.whl
RUN pip install --upgrade --upgrade-strategy eager optimum[amd]==1.22.0 fastapi[standard]
WORKDIR /workspace
Limitations
- Sequence length: Documents are truncated to 512 tokens; quality assessment is based on the beginning of documents only
- Language scope: Optimized for French and English; performance on other languages not evaluated
- Subjectivity: Quality labels are synthetic, generated by an LLM, which may introduce biases from the teacher model
Related Models
- Gaperon-1125-1.5B-SFT - 1.5B parameter bilingual LM
- Gaperon-1125-8B-SFT - 8B parameter bilingual LM
- Gaperon-1125-24B-SFT - 24B parameter bilingual LM
Model Card Authors
ALMAnaCH team, Inria Paris
Additional Resources
- 🔗 GitHub: https://github.com/NathanGodey/gapetron
- 📄 Paper: 📄 Paper Link
- 🔧 Evaluation Tools: https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon
Citation
If you use this model, please cite:
@misc{godey2025gaperonpepperedenglishfrenchgenerative,
title={Gaperon: A Peppered English-French Generative Language Model Suite},
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2510.25771},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.25771},
}
Acknowledgments
This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.
- Downloads last month
- 20
Model tree for almanach/gaperon-quality-classifier
Base model
facebook/xlm-v-base