finerweb-multilabel-classifier-xlmr-4o
This model is a fine-tuned version of FacebookAI/xlm-roberta-base on the FiNERweb dataset. It was presented in the paper FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition.
It achieves the following results on the evaluation set:
- Loss: 0.3645
- Precision: 0.5930
- Recall: 0.4813
- F1 Macro: 0.5091
- Accuracy: 0.6488
Model description
The finerweb-multilabel-classifier-xlmr-4o model is a component of the FiNERweb dataset-creation pipeline, which aims to scale the teacher-student paradigm for Named Entity Recognition (NER) to 91 languages and 25 scripts. This specific model, based on the XLM-RoBERTa-base architecture, functions as a regression model (or multilabel classifier) trained to identify NER-relevant passages from text. It plays a crucial role in the pipeline by pre-filtering large corpora, enabling more efficient annotation by multilingual LLMs and the creation of systematic, reusable resources for multilingual NER.
Intended uses & limitations
This model is intended for use in identifying passages relevant for Named Entity Recognition across multiple languages, acting as an upstream component in a larger data generation and annotation pipeline. It can facilitate the creation of synthetic supervision for student models in zero-shot transfer settings on various languages.
Limitations:
- The model's primary function is passage classification (identifying whether a passage contains NER-relevant information), not performing token-level Named Entity Recognition itself.
- While designed for multilingual contexts, the paper notes that the performance of current state-of-the-art models can drop when evaluated using target language labels instead of English ones, suggesting potential variations in efficacy across different languages and scripts.
Training and evaluation data
This model was trained on data generated by the FiNERweb dataset-creation pipeline. This pipeline leverages FineWeb-Edu, training regression models to identify NER-relevant passages, which are then annotated using multilingual LLMs. The resulting FiNERweb dataset comprises approximately 225k passages with 235k distinct entity labels.
The FiNERweb datasets are available on the Hugging Face Hub:
- FiNERweb
- FiNERweb-x (with translated labels)
How to use
You can load and use this model with the Hugging Face transformers library to classify text passages for NER relevance.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("whoisjones/finerweb-multilabel-classifier-xlmr-4o")
tokenizer = AutoTokenizer.from_pretrained("whoisjones/finerweb-multilabel-classifier-xlmr-4o")
good_example = """'Kraft Foods has taken the Cadbury chocolate brand in a new direction, by combining it with cheese for the first time.
The company is bringing together two of its brands and launching Philadelphia with Cadbury, a chilled chocolate spread made from Philadelphia Light and Cadbury chocolate.
Kraft believes the new product has the potential to do very well and is targeting £10m in sales in the first year.
The new cheese and chocolate spread is being launched on 1 February and will be appear in the chilled dairy aisle next to plain Philadelphia Light.
It is launching in a 160g tub and a 120g four-pack of mini tubs, both with an rsp of £1.62.
Kraft is supporting the launch with a £3.2m marketing budget in 2012 and is targeting 2,000 tonnes in volume sales – equivalent to about £10m – in the first year.
If they reached this volume of sales, the new Philadelphia with Cadbury would have the same market value as Garlic & Herb, currently the biggest-selling flavour in the Philadelphia portfolio.
Kraft already offers chocolate variants of Philadelphia in Italy and Germany, using Milka chocolate and targeting the breakfast occasion.
In Germany, Philadelphia with Milka has generated €22.2m in sales since its October 2010 launch and has a 6.6% value share of the chocolate spread market.
Kraft Foods UK marketing manager Bruce Newman said:
“The UK product would be positioned as a snack.
“The breakfast market in countries such as Germany is more developed, and our consumer research firmly identified Philadelphia with Cadbury as a snack.”'"""
bad_example = """'|Viewing Single Post From: Spoilers for the Week of February 11th| |Lil||Feb 1 2013, 09:58 AM| Don\'t care about Chloe/Taniel/Jen-Jen . Don\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it . And "newfound" feelings . Please . If only . STEFANO!! STEFANO, STEFANO, STEFANO!!!!: cheer: |Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|'"""
with torch.no_grad():
good_example_inputs = tokenizer(good_example, return_tensors='pt')
bad_example_inputs = tokenizer(bad_example, return_tensors="pt")
good_example_outputs = model(**good_example_inputs)
bad_example_outputs = model(**bad_example_inputs)
print(good_example_outputs.logits)
print(bad_example_outputs.logits)
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 64
- eval_batch_size: 128
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 20
Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 Macro | Accuracy |
|:-------------:|:-------:|:-----:|:---------------:|:---------:|:------:|:--------:|:--------:|
| No log | 0 | 0 | 8.1215 | 0.0002 | 0.2 | 0.0004 | 0.0010 |
| 0.2936 | 0.7812 | 1000 | 0.2877 | 0.5095 | 0.4544 | 0.4485 | 0.6647 |
| 0.2516 | 1.5625 | 2000 | 0.3197 | 0.5299 | 0.3886 | 0.4136 | 0.6293 |
| 0.1737 | 2.3438 | 3000 | 0.2922 | 0.5089 | 0.4296 | 0.4518 | 0.6516 |
| 0.0996 | 3.125 | 4000 | 0.3233 | 0.5012 | 0.4379 | 0.4585 | 0.6369 |
| 0.1051 | 3.9062 | 5000 | 0.3108 | 0.5042 | 0.4401 | 0.4609 | 0.6496 |
| 0.0601 | 4.6875 | 6000 | 0.3501 | 0.4840 | 0.4501 | 0.4614 | 0.6411 |
| 0.0588 | 5.4688 | 7000 | 0.3554 | 0.4758 | 0.4585 | 0.4658 | 0.6327 |
| 0.0385 | 6.25 | 8000 | 0.3527 | 0.4853 | 0.4518 | 0.4647 | 0.6331 |
| 0.0312 | 7.0312 | 9000 | 0.3475 | 0.4912 | 0.4580 | 0.4714 | 0.6415 |
| 0.0314 | 7.8125 | 10000 | 0.3442 | 0.4983 | 0.4557 | 0.4679 | 0.6551 |
| 0.0211 | 8.5938 | 11000 | 0.3543 | 0.4823 | 0.4740 | 0.4769 | 0.6449 |
| 0.0193 | 9.375 | 12000 | 0.3585 | 0.4816 | 0.4621 | 0.4697 | 0.6438 |
| 0.0208 | 10.1562 | 13000 | 0.3730 | 0.4825 | 0.4588 | 0.4659 | 0.6207 |
| 0.0203 | 10.9375 | 14000 | 0.3578 | 0.4903 | 0.4748 | 0.4818 | 0.6468 |
| 0.0154 | 11.7188 | 15000 | 0.3513 | 0.4994 | 0.4591 | 0.4744 | 0.6557 |
| 0.0126 | 12.5 | 16000 | 0.3623 | 0.4920 | 0.4488 | 0.4649 | 0.6460 |
| 0.0087 | 13.2812 | 17000 | 0.3632 | 0.4940 | 0.4512 | 0.4675 | 0.6449 |
| 0.0091 | 14.0625 | 18000 | 0.3518 | 0.6976 | 0.4793 | 0.5121 | 0.6533 |
| 0.009 | 14.8438 | 19000 | 0.3569 | 0.5918 | 0.4886 | 0.5128 | 0.6527 |
| 0.0056 | 15.625 | 20000 | 0.3672 | 0.5532 | 0.4882 | 0.5081 | 0.6457 |
| 0.0045 | 16.4062 | 21000 | 0.3655 | 0.5391 | 0.4870 | 0.5060 | 0.6460 |
| 0.0035 | 17.1875 | 22000 | 0.3646 | 0.4955 | 0.4634 | 0.4765 | 0.6489 |
| 0.0032 | 17.9688 | 23000 | 0.3631 | 0.6942 | 0.4841 | 0.5150 | 0.6503 |
| 0.0035 | 18.75 | 24000 | 0.3625 | 0.5986 | 0.4805 | 0.5103 | 0.6504 |
| 0.0019 | 19.5312 | 25000 | 0.3645 | 0.5930 | 0.4813 | 0.5091 | 0.6488 |
Framework versions
- Transformers 4.49.0
- Pytorch 2.6.0+cu124
- Datasets 3.3.2
- Tokenizers 0.21.1
Paper
More details about this model and the FiNERweb project can be found in the official paper: FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition.
Code
The code for the FiNERweb project, including the implementation for training and using these regression models, is available on GitHub: https://github.com/whoisjones/FiNERweb-code.
Citation
If you find our work useful, please consider citing our paper:
@misc{golde2025finerwebdatasetsartifactsscalable,
title={FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition},
author={Jonas Golde and Patrick Haller and Alan Akbik},
year={2025},
eprint={2512.13884},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13884},
}
- Downloads last month
- 6
Model tree for whoisjones/finerweb-multilabel-classifier-xlmr-4o
Base model
FacebookAI/xlm-roberta-base