OpenCensor-H1-Mini / README.md
LikoKiko
Init
0e7f17a
---
language:
- he
license: cc-by-sa-4.0
tags:
- text-classification
- profanity-detection
- toxicity
- hebrew
- bert
- alephbert
library_name: transformers
base_model: onlplab/alephbert-base
pipeline_tag: text-classification
datasets:
- custom
metrics:
- accuracy
- precision
- recall
- f1
---
# OpenCensor-H1-Mini
**OpenCensor-H1-Mini** is a lightweight, efficient version of **OpenCensor-H1**, designed to detect profanity, toxicity, and offensive content in Hebrew text. It is fine-tuned on the `onlplab/alephbert-base` architecture.
## Model Details
- **Model Name:** OpenCensor-H1-Mini
- **Base Model:** `onlplab/alephbert-base`
- **Task:** Binary Classification (0 = Clean, 1 = Toxic/Profane)
- **Language:** Hebrew
- **Max Sequence Length:** 256 tokens (optimized for efficiency)
## Performance
| Metric | Score |
| :--- | :--- |
| **Accuracy** | 0.9826 |
| **F1-Score** | 0.9823 |
| **Precision** | 0.9812 |
| **Recall** | 0.9835 |
*Note: Best Threshold = 0.17*
### Training Graphs
| Validation F1 | Threshold Analysis |
| :---: | :---: |
| ![Validation F1](valf1perepoch.png) | ![Thresholds](thresholdsperepoch.png) |
![Final Test Metrics](testmetrics.png)
## How to Use
You can use this model directly with the Hugging Face `transformers` library.
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the model
model_id = "LikoKIko/OpenCensor-H1-Mini"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id).eval()
def predict(text):
# Tokenize input
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=256
)
# Predict
with torch.no_grad():
logits = model(**inputs).logits
score = torch.sigmoid(logits).item()
return {
"text": text,
"score": round(score, 4),
"is_toxic": score >= 0.17 # Threshold
}
# Example usage
text = "ืื ื™ ืื•ื”ื‘ ืืช ื›ื•ืœื" # "I love everyone"
print(predict(text))
```
## Training Info
The model was trained using an optimized pipeline featuring:
- **Gradient Accumulation:** Ensures stable training with larger effective batch sizes.
- **Smart Text Cleaning:** Removes noise while preserving Hebrew, English, and important symbols (`@#$%*`).
- **Dynamic Padding:** Uses efficient token lengths based on data distribution.
## License
CC-BY-SA-4.0
## Citation
```bibtex
@misc{opencensor-h1-mini,
title = {OpenCensor-H1-Mini: Hebrew Profanity Detection Model},
author = {LikoKIko},
year = {2025},
url = {https://huggingface.co/LikoKIko/OpenCensor-H1-Mini}
}
```