variant-tapt_freeze_llrd-LR_5e-05

This model is a masked language model (MLM) fine-tuned version of microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext on the Mardiyyah/TAPT_Variant_FT dataset. It was trained to improve downstream NER performance for variant entity recognition in molecular biology by adapting the tokenizer to include rare biomedical and domain-specific terms that are critical for accurate entity tagging.

Evaluation results on held-out set:

  • Loss: 1.3638
  • Accuracy: 0.7251

Model description

  • Base model: BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)

  • Model type: Encoder-only transformer (BERT-style)

  • Vocabulary adaptation:

    • Added rare biomedical tokens and NER-specific terms to prevent fragmentation of meaningful entities (e.g., Asp447SERK1)
    • Preserved base vocabulary to retain general biomedical knowledge

Fine-tuning: Task-Adaptive Pretraining (TAPT) on domain-specific unlabelled molecular biology text

Intended uses & limitations

Intended uses:

  • Pretraining for NER models in molecular biology, especially variant or mutant extraction

  • Enhancing representation of rare or task-specific biomedical terms

  • Could be used as a feature extractor or base model for further downstream biomedical NLP tasks (e.g., relation extraction, entity linking)

Limitations:

  • MLM performance metrics may not directly correlate with downstream NER performance

  • Fine-tuned primarily on molecular biology datasets; may not generalsze to other biomedical subdomains (e.g., clinical notes, proteomics)

  • Tokenization strategy is domain-adapted; using this model outside of molecular biology may lead to unexpected tokenization splits or embeddings

Training and evaluation data

  • Training data: Unlabelled domain-specific molecular biology text from Mardiyyah/TAPT_Variant_FT dataset

  • Evaluation data: Held-out split of the same dataset

  • Preprocessing included: cleaning, lowercasing, tokenization with extended vocabulary, grouping text into sequences of model-supported max length

Training procedure

  1. Tokenizer adaptation:
  • Extended base tokenizer with task-specific and rare biomedical terms

  • Ensured new embeddings were initialized to preserve semantic meaning using the average of subtoken embeddings

  1. Task-Adaptive Pretraining (TAPT):
  • MLM objective on unlabelled domain-specific data

  • Strategy: freeze lower layers + layer-wise learning rate decay (LLRD)

Notes:

  • Whole Word Masking was not used due to unstable loss spikes
  • Adapted tokenizer (before TAPT) can be found here Please note: this initial extended tokenizer contains candidate tokens whose embeddings have not yet been trained. It is intended for TAPT and should not be used directly for downstream tasks, as embeddings are uninitialized and may not carry semantic meaning. After TAPT, the final tokenizer and model embeddings are suitable for downstream variant entity recognition.

Hyperparameters:

  • Learning rate: 5e-5

  • Batch size: 16

  • Epochs: 20

  • Optimizer: AdamW with ε=1e-6, weight decay=0.01

  • Warmup ratio: 6%

  • Gradient clipping: max_grad_norm=1.0

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 3407
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.06
  • num_epochs: 20
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Accuracy
1.6214 1.0 19 1.5767 0.7124
1.5605 2.0 38 1.5344 0.7151
1.5067 3.0 57 1.4533 0.7252
1.4634 4.0 76 1.4167 0.7215
1.4455 5.0 95 1.3822 0.7329
1.4099 6.0 114 1.3743 0.7282
1.3973 7.0 133 1.4204 0.7222
1.3562 8.0 152 1.3735 0.7346
1.3932 9.0 171 1.4068 0.7304
1.3475 10.0 190 1.3963 0.7315
1.3281 11.0 209 1.3168 0.7325
1.309 12.0 228 1.3306 0.7314
1.3291 13.0 247 1.3196 0.7382
1.2921 14.0 266 1.3361 0.7305
1.2992 15.0 285 1.2798 0.7351
1.2954 16.0 304 1.3871 0.7234
1.2907 17.0 323 1.3480 0.7351
1.2547 18.0 342 1.3165 0.7349
1.2802 19.0 361 1.3303 0.7313
1.2702 20.0 380 1.3541 0.7296

Framework versions

  • Transformers 4.48.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.2
  • Tokenizers 0.21.0
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mardiyyah/variant-tapt_freeze_llrd-LR_5e-05

Dataset used to train Mardiyyah/variant-tapt_freeze_llrd-LR_5e-05