DeBERTa-v3-xsmall β€” Prose Sophistication Classifier

This model is a DeBERTa-v3-xsmall fine-tuned to classify short-to-mid-length prose chunks as "sophisticated" or "simple" for the Prose Sophistication project.

Model: deberta-v3-xsmall (fine-tuned)

Important: This model was trained with the label encoding:

  • 0 = sophisticated
  • 1 = simple

Make sure to map outputs accordingly when using the model in downstream code.

Intended uses

  • Filtering / mining literary corpora for passages that match a "sophisticated" style.
  • Human-in-the-loop curation where high precision on sophisticated passages is required.

Not intended for

  • Judging authors or people. This classifier is stylistic and dataset-dependent.
  • Sensitive decisions, profiling, or any application that requires full fairness guarantees.

Training data

  • Dataset: data/data-snapshot-mixed-20250114.jsonl (snapshot prepared for this project)
  • Total samples (approx): 16,006
  • Split: 85% train / 15% validation

Evaluation

  • Validation accuracy: 97.92%
  • Validation F1 (macro): 97.63%
  • Real-book sample (30 chunks from 10 books): 100% detection of sophisticated passages (high confidence)

These metrics were produced during internal evaluation. See repo training_results.json for full logs and per-model results.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_path = "models/deberta_v3_xsmall_new"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()

def classify(text, max_length=512, device='cpu'):
    inputs = tokenizer(text, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        out = model(**inputs)
        probs = torch.softmax(out.logits, dim=1)
        # Model encoding: 0 = sophisticated, 1 = simple
        prob_soph = probs[0, 0].item()
        prob_simple = probs[0, 1].item()
        is_sophisticated = prob_soph > prob_simple
    return {
        'is_sophisticated': bool(is_sophisticated),
        'prob_sophisticated': prob_soph,
        'prob_simple': prob_simple
    }

Limitations and biases

  • The model reflects biases present in the training snapshot; style distributions and genres strongly affect predictions.
  • It was validated on a small set of real-book chunks β€” please validate on your target domain before large-scale use.

Files

  • model.safetensors β€” model weights (tracked with Git LFS)
  • tokenizer.json, spm.model β€” tokenizer files
  • config.json β€” model configuration

License

Specify a license before publishing to Hugging Face (e.g., Apache-2.0 or CC BY-NC).

Citation

If you use this model in research, please cite the project repository and note the dataset snapshot: data/data-snapshot-mixed-20250114.jsonl.


For questions, or to update the model card metadata, edit this file or modelcard.json in the same folder.

Downloads last month
-
Safetensors
Model size
70.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support