DeBERTa-v3-xsmall β Prose Sophistication Classifier
This model is a DeBERTa-v3-xsmall fine-tuned to classify short-to-mid-length prose chunks as "sophisticated" or "simple" for the Prose Sophistication project.
Model: deberta-v3-xsmall (fine-tuned)
Important: This model was trained with the label encoding:
0= sophisticated1= simple
Make sure to map outputs accordingly when using the model in downstream code.
Intended uses
- Filtering / mining literary corpora for passages that match a "sophisticated" style.
- Human-in-the-loop curation where high precision on sophisticated passages is required.
Not intended for
- Judging authors or people. This classifier is stylistic and dataset-dependent.
- Sensitive decisions, profiling, or any application that requires full fairness guarantees.
Training data
- Dataset:
data/data-snapshot-mixed-20250114.jsonl(snapshot prepared for this project) - Total samples (approx): 16,006
- Split: 85% train / 15% validation
Evaluation
- Validation accuracy: 97.92%
- Validation F1 (macro): 97.63%
- Real-book sample (30 chunks from 10 books): 100% detection of sophisticated passages (high confidence)
These metrics were produced during internal evaluation. See repo training_results.json for full logs and per-model results.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_path = "models/deberta_v3_xsmall_new"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
def classify(text, max_length=512, device='cpu'):
inputs = tokenizer(text, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
out = model(**inputs)
probs = torch.softmax(out.logits, dim=1)
# Model encoding: 0 = sophisticated, 1 = simple
prob_soph = probs[0, 0].item()
prob_simple = probs[0, 1].item()
is_sophisticated = prob_soph > prob_simple
return {
'is_sophisticated': bool(is_sophisticated),
'prob_sophisticated': prob_soph,
'prob_simple': prob_simple
}
Limitations and biases
- The model reflects biases present in the training snapshot; style distributions and genres strongly affect predictions.
- It was validated on a small set of real-book chunks β please validate on your target domain before large-scale use.
Files
model.safetensorsβ model weights (tracked with Git LFS)tokenizer.json,spm.modelβ tokenizer filesconfig.jsonβ model configuration
License
Specify a license before publishing to Hugging Face (e.g., Apache-2.0 or CC BY-NC).
Citation
If you use this model in research, please cite the project repository and note the dataset snapshot: data/data-snapshot-mixed-20250114.jsonl.
For questions, or to update the model card metadata, edit this file or modelcard.json in the same folder.
- Downloads last month
- -