DeBERTa-v3-xsmall — Prose Sophistication Classifier

This model is a DeBERTa-v3-xsmall fine-tuned to classify short-to-mid-length prose chunks as "sophisticated" or "simple" for the Prose Sophistication project.

Model: deberta-v3-xsmall (fine-tuned)

Important: This model was trained with the label encoding:

0 = sophisticated
1 = simple

Make sure to map outputs accordingly when using the model in downstream code.

Intended uses

Filtering / mining literary corpora for passages that match a "sophisticated" style.
Human-in-the-loop curation where high precision on sophisticated passages is required.

Not intended for

Judging authors or people. This classifier is stylistic and dataset-dependent.
Sensitive decisions, profiling, or any application that requires full fairness guarantees.

Training data

Dataset: data/data-snapshot-mixed-20250114.jsonl (snapshot prepared for this project)
Total samples (approx): 16,006
Split: 85% train / 15% validation

Evaluation

Validation accuracy: 97.92%
Validation F1 (macro): 97.63%
Real-book sample (30 chunks from 10 books): 100% detection of sophisticated passages (high confidence)

These metrics were produced during internal evaluation. See repo training_results.json for full logs and per-model results.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_path = "models/deberta_v3_xsmall_new"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()

def classify(text, max_length=512, device='cpu'):
    inputs = tokenizer(text, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        out = model(**inputs)
        probs = torch.softmax(out.logits, dim=1)
        # Model encoding: 0 = sophisticated, 1 = simple
        prob_soph = probs[0, 0].item()
        prob_simple = probs[0, 1].item()
        is_sophisticated = prob_soph > prob_simple
    return {
        'is_sophisticated': bool(is_sophisticated),
        'prob_sophisticated': prob_soph,
        'prob_simple': prob_simple
    }

Limitations and biases

The model reflects biases present in the training snapshot; style distributions and genres strongly affect predictions.
It was validated on a small set of real-book chunks — please validate on your target domain before large-scale use.

Files

model.safetensors — model weights (tracked with Git LFS)
tokenizer.json, spm.model — tokenizer files
config.json — model configuration

License

Specify a license before publishing to Hugging Face (e.g., Apache-2.0 or CC BY-NC).

Citation

If you use this model in research, please cite the project repository and note the dataset snapshot: data/data-snapshot-mixed-20250114.jsonl.

For questions, or to update the model card metadata, edit this file or modelcard.json in the same folder.

Downloads last month: -

Safetensors

Model size

70.8M params

Tensor type

F32