variant-tapt_freeze_llrd-LR_5e-05

This model is a masked language model (MLM) fine-tuned version of microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext on the Mardiyyah/TAPT_Variant_FT dataset. It was trained to improve downstream NER performance for variant entity recognition in molecular biology by adapting the tokenizer to include rare biomedical and domain-specific terms that are critical for accurate entity tagging.

Evaluation results on held-out set:

Loss: 1.3638
Accuracy: 0.7251

Model description

Base model: BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
Model type: Encoder-only transformer (BERT-style)
Vocabulary adaptation:
- Added rare biomedical tokens and NER-specific terms to prevent fragmentation of meaningful entities (e.g., Asp447SERK1)
- Preserved base vocabulary to retain general biomedical knowledge

Fine-tuning: Task-Adaptive Pretraining (TAPT) on domain-specific unlabelled molecular biology text

Intended uses & limitations

Intended uses:

Pretraining for NER models in molecular biology, especially variant or mutant extraction
Enhancing representation of rare or task-specific biomedical terms
Could be used as a feature extractor or base model for further downstream biomedical NLP tasks (e.g., relation extraction, entity linking)

Limitations:

MLM performance metrics may not directly correlate with downstream NER performance
Fine-tuned primarily on molecular biology datasets; may not generalsze to other biomedical subdomains (e.g., clinical notes, proteomics)
Tokenization strategy is domain-adapted; using this model outside of molecular biology may lead to unexpected tokenization splits or embeddings

Training and evaluation data

Training data: Unlabelled domain-specific molecular biology text from Mardiyyah/TAPT_Variant_FT dataset
Evaluation data: Held-out split of the same dataset
Preprocessing included: cleaning, lowercasing, tokenization with extended vocabulary, grouping text into sequences of model-supported max length

Training procedure

Tokenizer adaptation:

Extended base tokenizer with task-specific and rare biomedical terms
Ensured new embeddings were initialized to preserve semantic meaning using the average of subtoken embeddings

Task-Adaptive Pretraining (TAPT):

MLM objective on unlabelled domain-specific data
Strategy: freeze lower layers + layer-wise learning rate decay (LLRD)

Notes:

Whole Word Masking was not used due to unstable loss spikes
Adapted tokenizer (before TAPT) can be found here Please note: this initial extended tokenizer contains candidate tokens whose embeddings have not yet been trained. It is intended for TAPT and should not be used directly for downstream tasks, as embeddings are uninitialized and may not carry semantic meaning. After TAPT, the final tokenizer and model embeddings are suitable for downstream variant entity recognition.

Hyperparameters:

Learning rate: 5e-5
Batch size: 16
Epochs: 20
Optimizer: AdamW with ε=1e-6, weight decay=0.01
Warmup ratio: 6%
Gradient clipping: max_grad_norm=1.0

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 16
eval_batch_size: 16
seed: 3407
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.06
num_epochs: 20
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.6214	1.0	19	1.5767	0.7124
1.5605	2.0	38	1.5344	0.7151
1.5067	3.0	57	1.4533	0.7252
1.4634	4.0	76	1.4167	0.7215
1.4455	5.0	95	1.3822	0.7329
1.4099	6.0	114	1.3743	0.7282
1.3973	7.0	133	1.4204	0.7222
1.3562	8.0	152	1.3735	0.7346
1.3932	9.0	171	1.4068	0.7304
1.3475	10.0	190	1.3963	0.7315
1.3281	11.0	209	1.3168	0.7325
1.309	12.0	228	1.3306	0.7314
1.3291	13.0	247	1.3196	0.7382
1.2921	14.0	266	1.3361	0.7305
1.2992	15.0	285	1.2798	0.7351
1.2954	16.0	304	1.3871	0.7234
1.2907	17.0	323	1.3480	0.7351
1.2547	18.0	342	1.3165	0.7349
1.2802	19.0	361	1.3303	0.7313
1.2702	20.0	380	1.3541	0.7296

Framework versions

Transformers 4.48.2
Pytorch 2.4.1+cu121
Datasets 3.0.2
Tokenizers 0.21.0

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Mardiyyah/variant-tapt_freeze_llrd-LR_5e-05

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(125)

this model

Finetunes

2 models

Mardiyyah
/

variant-tapt_freeze_llrd-LR_5e-05