|
|
---
|
|
|
language:
|
|
|
- he
|
|
|
license: cc-by-sa-4.0
|
|
|
tags:
|
|
|
- text-classification
|
|
|
- profanity-detection
|
|
|
- toxicity
|
|
|
- hebrew
|
|
|
- bert
|
|
|
- alephbert
|
|
|
library_name: transformers
|
|
|
base_model: onlplab/alephbert-base
|
|
|
pipeline_tag: text-classification
|
|
|
datasets:
|
|
|
- custom
|
|
|
metrics:
|
|
|
- accuracy
|
|
|
- precision
|
|
|
- recall
|
|
|
- f1
|
|
|
---
|
|
|
|
|
|
# OpenCensor-H1-Mini
|
|
|
|
|
|
**OpenCensor-H1-Mini** is a lightweight, efficient version of **OpenCensor-H1**, designed to detect profanity, toxicity, and offensive content in Hebrew text. It is fine-tuned on the `onlplab/alephbert-base` architecture.
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
- **Model Name:** OpenCensor-H1-Mini
|
|
|
- **Base Model:** `onlplab/alephbert-base`
|
|
|
- **Task:** Binary Classification (0 = Clean, 1 = Toxic/Profane)
|
|
|
- **Language:** Hebrew
|
|
|
- **Max Sequence Length:** 256 tokens (optimized for efficiency)
|
|
|
|
|
|
## Performance
|
|
|
|
|
|
| Metric | Score |
|
|
|
| :--- | :--- |
|
|
|
| **Accuracy** | 0.9826 |
|
|
|
| **F1-Score** | 0.9823 |
|
|
|
| **Precision** | 0.9812 |
|
|
|
| **Recall** | 0.9835 |
|
|
|
|
|
|
*Note: Best Threshold = 0.17*
|
|
|
|
|
|
### Training Graphs
|
|
|
|
|
|
| Validation F1 | Threshold Analysis |
|
|
|
| :---: | :---: |
|
|
|
|  |  |
|
|
|
|
|
|

|
|
|
|
|
|
## How to Use
|
|
|
|
|
|
You can use this model directly with the Hugging Face `transformers` library.
|
|
|
|
|
|
```python
|
|
|
import torch
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
|
|
|
# Load the model
|
|
|
model_id = "LikoKIko/OpenCensor-H1-Mini"
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id).eval()
|
|
|
|
|
|
def predict(text):
|
|
|
# Tokenize input
|
|
|
inputs = tokenizer(
|
|
|
text,
|
|
|
return_tensors="pt",
|
|
|
truncation=True,
|
|
|
padding=True,
|
|
|
max_length=256
|
|
|
)
|
|
|
|
|
|
# Predict
|
|
|
with torch.no_grad():
|
|
|
logits = model(**inputs).logits
|
|
|
score = torch.sigmoid(logits).item()
|
|
|
|
|
|
return {
|
|
|
"text": text,
|
|
|
"score": round(score, 4),
|
|
|
"is_toxic": score >= 0.17 # Threshold
|
|
|
}
|
|
|
|
|
|
# Example usage
|
|
|
text = "ืื ื ืืืื ืืช ืืืื" # "I love everyone"
|
|
|
print(predict(text))
|
|
|
```
|
|
|
|
|
|
## Training Info
|
|
|
|
|
|
The model was trained using an optimized pipeline featuring:
|
|
|
- **Gradient Accumulation:** Ensures stable training with larger effective batch sizes.
|
|
|
- **Smart Text Cleaning:** Removes noise while preserving Hebrew, English, and important symbols (`@#$%*`).
|
|
|
- **Dynamic Padding:** Uses efficient token lengths based on data distribution.
|
|
|
|
|
|
## License
|
|
|
|
|
|
CC-BY-SA-4.0
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
```bibtex
|
|
|
@misc{opencensor-h1-mini,
|
|
|
title = {OpenCensor-H1-Mini: Hebrew Profanity Detection Model},
|
|
|
author = {LikoKIko},
|
|
|
year = {2025},
|
|
|
url = {https://huggingface.co/LikoKIko/OpenCensor-H1-Mini}
|
|
|
}
|
|
|
```
|
|
|
|