--- language: - he license: cc-by-sa-4.0 tags: - text-classification - profanity-detection - toxicity - hebrew - bert - alephbert library_name: transformers base_model: onlplab/alephbert-base pipeline_tag: text-classification datasets: - custom metrics: - accuracy - precision - recall - f1 --- # OpenCensor-H1-Mini **OpenCensor-H1-Mini** is a lightweight, efficient version of **OpenCensor-H1**, designed to detect profanity, toxicity, and offensive content in Hebrew text. It is fine-tuned on the `onlplab/alephbert-base` architecture. ## Model Details - **Model Name:** OpenCensor-H1-Mini - **Base Model:** `onlplab/alephbert-base` - **Task:** Binary Classification (0 = Clean, 1 = Toxic/Profane) - **Language:** Hebrew - **Max Sequence Length:** 256 tokens (optimized for efficiency) ## Performance | Metric | Score | | :--- | :--- | | **Accuracy** | 0.9826 | | **F1-Score** | 0.9823 | | **Precision** | 0.9812 | | **Recall** | 0.9835 | *Note: Best Threshold = 0.17* ### Training Graphs | Validation F1 | Threshold Analysis | | :---: | :---: | | ![Validation F1](valf1perepoch.png) | ![Thresholds](thresholdsperepoch.png) | ![Final Test Metrics](testmetrics.png) ## How to Use You can use this model directly with the Hugging Face `transformers` library. ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load the model model_id = "LikoKIko/OpenCensor-H1-Mini" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id).eval() def predict(text): # Tokenize input inputs = tokenizer( text, return_tensors="pt", truncation=True, padding=True, max_length=256 ) # Predict with torch.no_grad(): logits = model(**inputs).logits score = torch.sigmoid(logits).item() return { "text": text, "score": round(score, 4), "is_toxic": score >= 0.17 # Threshold } # Example usage text = "אני אוהב את כולם" # "I love everyone" print(predict(text)) ``` ## Training Info The model was trained using an optimized pipeline featuring: - **Gradient Accumulation:** Ensures stable training with larger effective batch sizes. - **Smart Text Cleaning:** Removes noise while preserving Hebrew, English, and important symbols (`@#$%*`). - **Dynamic Padding:** Uses efficient token lengths based on data distribution. ## License CC-BY-SA-4.0 ## Citation ```bibtex @misc{opencensor-h1-mini, title = {OpenCensor-H1-Mini: Hebrew Profanity Detection Model}, author = {LikoKIko}, year = {2025}, url = {https://huggingface.co/LikoKIko/OpenCensor-H1-Mini} } ```