Pushing Onnx model to Hugging Face Hub

#2
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,153 +1,141 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- base_model:
6
- - answerdotai/ModernBERT-base
7
- pipeline_tag: fill-mask
8
  tags:
9
- - fill-mask
10
- - masked-lm
11
- - long-context
12
- - modernbert
13
- - BioClinical-ModernBERT
14
- library_name: transformers
15
  ---
16
 
17
- # BioClinical ModernBERT
18
 
19
- *BioClinical ModernBERT is available in two sizes: [base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base) (150M parameters) and [large](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-large) (396M parameters). The model training checkpoints can be found [here](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-checkpoints), and our code is available in our [GitHub repository](https://github.com/lindvalllab/BioClinical-ModernBERT).*
20
 
21
- ## Table of Contents
22
- 1. [Model Summary](#model-summary)
23
- 2. [Usage](#usage)
24
- 3. [Training](#training)
25
- 4. [Evaluation](#evaluation)
26
- 5. [License](#license)
27
- 6. [Citation](#citation)
28
 
29
- ## Model Summary
 
 
 
 
 
 
 
 
30
 
31
- BioClinical ModernBERT is a domain-adapted encoder that builds on ModernBERT [base](https://huggingface.co/answerdotai/ModernBERT-base) and [large](https://huggingface.co/answerdotai/ModernBERT-large), incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is trained on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source.
32
 
33
- ## Usage
 
 
34
 
35
- You can use these models directly with the `transformers` library starting from v4.48.0:
36
 
37
- ```sh
38
- pip install -U transformers>=4.48.0
 
 
 
39
  ```
40
 
41
- Since BioClinical ModernBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. To use BioClinical ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.
42
 
43
- **⚠️ If your GPU supports it, we recommend using BioClinical ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
 
 
44
 
45
  ```bash
46
- pip install flash-attn
47
  ```
48
 
49
- Using `AutoModelForMaskedLM`:
50
-
51
  ```python
52
- from transformers import AutoTokenizer, AutoModelForMaskedLM
53
- model_id = "thomas-sounack/BioClinical-ModernBERT-base"
54
- tokenizer = AutoTokenizer.from_pretrained(model_id)
55
- model = AutoModelForMaskedLM.from_pretrained(model_id)
56
- text = "Mitochondria is the powerhouse of the [MASK]."
57
- inputs = tokenizer(text, return_tensors="pt")
58
- outputs = model(**inputs)
59
- # To get predictions for the mask:
60
- masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
61
- predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
62
- predicted_token = tokenizer.decode(predicted_token_id)
63
- print("Predicted token:", predicted_token)
64
- # Predicted token: cell
 
 
 
 
 
65
  ```
66
 
67
- Using a pipeline:
 
68
 
69
- ```python
70
- import torch
71
- from transformers import pipeline
72
- from pprint import pprint
73
- pipe = pipeline(
74
- "fill-mask",
75
- model="thomas-sounack/BioClinical-ModernBERT-base",
76
- torch_dtype=torch.bfloat16,
77
- )
78
- input_text = "[MASK] is a disease caused by an uncontrolled division of abnormal cells in a part of the body."
79
- results = pipe(input_text)
80
- pprint(results)
81
- ```
 
82
 
83
- **Note:** BioClinical ModernBERT, similarly to ModernBERT, does not use token type IDs unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the `token_type_ids` parameter.
84
-
85
- ## Training
86
-
87
- ### Data
88
-
89
- BioClinical ModernBERT is trained on 50.7B tokens of biomedical text gathered from PubMed and PMC, and 2.8B tokens of clinical text from 20 datasets which are detailed in the table below.
90
-
91
- | Name | Country | Clinical Source | Clinical Context | Samples | Tokens (M) |
92
- |----------------------------|--------------|------------------------------------|-----------------------|-----------|------------|
93
- | ACI-BENCH | US | Clinical Notes | Not Reported | 207 | 0.1 |
94
- | ADE Corpus | Several | Clinical Notes | Not Reported | 20,896 | 0.5 |
95
- | Brain MRI Stroke | Korea | Radiology Reports | Neurology | 2,603 | 0.2 |
96
- | CheXpert Plus | US | Radiology Reports | Pulmonology | 223,460 | 60.6 |
97
- | CHIFIR | Australia | Pathology Reports | Hematology / Oncology | 283 | 0.1 |
98
- | CORAL | US | Progress Notes | Hematology / Oncology | 240 | 0.7 |
99
- | Eye Gaze CXR | US | Radiology Reports | Pulmonology | 892 | 0.03 |
100
- | Gout Chief Complaints | US | Chief Complaint | Internal Medicine | 8,429 | 0.2 |
101
- | ID-68 | UK | Clinical Notes | Psychology | 78 | 0.02 |
102
- | Inspect | US | Radiology Reports | Pulmonology | 22,259 | 2.8 |
103
- | MedNLI | US | Clinical Notes | Internal Medicine | 14,047 | 0.5 |
104
- | MedQA | US | National Medical Board Examination | Not Reported | 14,366 | 2.0 |
105
- | MIMIC-III | US | Clinical Notes | Internal Medicine | 2,021,411 | 1,047.7 |
106
- | MIMIC-IV Note | US | Clinical Notes | Internal Medicine | 2,631,243 | 1,765.7 |
107
- | MTSamples | Not Reported | Clinical Notes | Internal Medicine | 2,358 | 1.7 |
108
- | Negex | US | Discharge Summaries | Not Reported | 2,056 | 0.1 |
109
- | PriMock57 | UK | Simulated Patient Care | Internal Medicine | 57 | 0.01 |
110
- | Q-Pain | US | Clinical Vignettes | Palliative Care | 51 | 0.01 |
111
- | REFLACX | US | Radiology Reports | Pulmonology | 2,543 | 0.1 |
112
- | Simulated Resp. Interviews | Canada | Simulated Patient Care | Pulmonology | 272 | 0.6 |
113
-
114
- ### Methodology
115
-
116
- BioClinical ModernBERT base is trained in two phases. This model is initialized from the last stable-phase checkpoint of ModernBERT base and trained with the same hyperparameters: learning rate of 3e-4 and batch size of 72.
117
- - Phase 1: Training on 160.5B tokens from PubMed, PMC, and the 20 clinical datasets. Learning rate remains constant throughout this stage, and the masking probability is set at 30%.
118
- - Phase 2: Training on the 20 clinical datasets only. Masking probability is reduced to 15%. The model is trained for 3 epochs with a 1-sqrt learning rate decay.
119
-
120
- ## Evaluation
121
-
122
- | | Model | Context Length | ChemProt | Phenotype | COS | Social History | DEID |
123
- |-------|--------------------------------|----------------|----------|-----------|----------|----------------|----------|
124
- | Base | BioBERT | 512 | 89.5 | 26.6 | 94.9 | 55.8 | 74.3 |
125
- | | Clinical BERT | 512 | 88.3 | 25.8 | 95.0 | 55.2 | 74.2 |
126
- | | BioMed-RoBERTa | 512 | 89.0 | 36.8 | 94.9 | 55.2 | 81.1 |
127
- | | Clinical-BigBird | 4096 | 87.4 | 26.5 | 94.0 | 53.3 | 71.2 |
128
- | | Clinical-Longformer | 4096 | 74.2 | 46.4 | **95.2** | 56.8 | 82.3 |
129
- | | Clinical ModernBERT | 8192 | 86.9 | 54.9 | 93.7 | 53.8 | 44.4 |
130
- | | ModernBERT - base | 8192 | 89.5 | 48.4 | 94.0 | 53.1 | 78.3 |
131
- | | BioClinical ModernBERT - base | 8192 | 89.9 | 58.1 | 95.1 | **58.5** | 82.7 |
132
- | Large | ModernBERT - large | 8192 | 90.2 | 58.3 | 94.4 | 54.8 | 82.1 |
133
- | | BioClinical ModernBERT - large | 8192 | **90.8** | **60.8** | 95.1 | 57.1 | **83.8** |
134
-
135
- ## License
136
-
137
- We release the BioClinical ModernBERT base and large model weights and training checkpoints under the MIT license.
138
 
139
  ## Citation
140
 
141
- If you use BioClinical ModernBERT in your work, please cite our [preprint](https://arxiv.org/abs/2506.10896):
142
 
143
- ```
144
- @misc{sounack2025bioclinicalmodernbertstateoftheartlongcontext,
145
- title={BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP},
146
- author={Thomas Sounack and Joshua Davis and Brigitte Durieux and Antoine Chaffin and Tom J. Pollard and Eric Lehman and Alistair E. W. Johnson and Matthew McDermott and Tristan Naumann and Charlotta Lindvall},
147
- year={2025},
148
- eprint={2506.10896},
149
- archivePrefix={arXiv},
150
- primaryClass={cs.CL},
151
- url={https://arxiv.org/abs/2506.10896},
152
- }
153
- ```
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
2
  tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ base_model: thomas-sounack/BioClinical-ModernBERT-base
7
+ pipeline_tag: sentence-similarity
8
+ library_name: sentence-transformers
9
  ---
10
 
11
+ # SentenceTransformer based on thomas-sounack/BioClinical-ModernBERT-base
12
 
13
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [thomas-sounack/BioClinical-ModernBERT-base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
14
 
15
+ ## Model Details
 
 
 
 
 
 
16
 
17
+ ### Model Description
18
+ - **Model Type:** Sentence Transformer
19
+ - **Base model:** [thomas-sounack/BioClinical-ModernBERT-base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base) <!-- at revision 8ea6951dd0f48edbea0bdd3a081c78cada0ad70c -->
20
+ - **Maximum Sequence Length:** 8192 tokens
21
+ - **Output Dimensionality:** 768 dimensions
22
+ - **Similarity Function:** Cosine Similarity
23
+ <!-- - **Training Dataset:** Unknown -->
24
+ <!-- - **Language:** Unknown -->
25
+ <!-- - **License:** Unknown -->
26
 
27
+ ### Model Sources
28
 
29
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
 
33
+ ### Full Model Architecture
34
 
35
+ ```
36
+ SentenceTransformer(
37
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ORTModelForFeatureExtraction
38
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
+ )
40
  ```
41
 
42
+ ## Usage
43
 
44
+ ### Direct Usage (Sentence Transformers)
45
+
46
+ First install the Sentence Transformers library:
47
 
48
  ```bash
49
+ pip install -U sentence-transformers
50
  ```
51
 
52
+ Then you can load this model and run inference.
 
53
  ```python
54
+ from sentence_transformers import SentenceTransformer
55
+
56
+ # Download from the 🤗 Hub
57
+ model = SentenceTransformer("thomas-sounack/BioClinical-ModernBERT-base")
58
+ # Run inference
59
+ sentences = [
60
+ 'The weather is lovely today.',
61
+ "It's so sunny outside!",
62
+ 'He drove to the stadium.',
63
+ ]
64
+ embeddings = model.encode(sentences)
65
+ print(embeddings.shape)
66
+ # [3, 768]
67
+
68
+ # Get the similarity scores for the embeddings
69
+ similarities = model.similarity(embeddings, embeddings)
70
+ print(similarities.shape)
71
+ # [3, 3]
72
  ```
73
 
74
+ <!--
75
+ ### Direct Usage (Transformers)
76
 
77
+ <details><summary>Click to see the direct usage in Transformers</summary>
78
+
79
+ </details>
80
+ -->
81
+
82
+ <!--
83
+ ### Downstream Usage (Sentence Transformers)
84
+
85
+ You can finetune this model on your own dataset.
86
+
87
+ <details><summary>Click to expand</summary>
88
+
89
+ </details>
90
+ -->
91
 
92
+ <!--
93
+ ### Out-of-Scope Use
94
+
95
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
+ -->
97
+
98
+ <!--
99
+ ## Bias, Risks and Limitations
100
+
101
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
+ -->
103
+
104
+ <!--
105
+ ### Recommendations
106
+
107
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
+ -->
109
+
110
+ ## Training Details
111
+
112
+ ### Framework Versions
113
+ - Python: 3.13.5
114
+ - Sentence Transformers: 4.1.0
115
+ - Transformers: 4.52.4
116
+ - PyTorch: 2.7.1
117
+ - Accelerate:
118
+ - Datasets: 3.6.0
119
+ - Tokenizers: 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
  ## Citation
122
 
123
+ ### BibTeX
124
 
125
+ <!--
126
+ ## Glossary
127
+
128
+ *Clearly define terms in order to be accessible across audiences.*
129
+ -->
130
+
131
+ <!--
132
+ ## Model Card Authors
133
+
134
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
+ -->
136
+
137
+ <!--
138
+ ## Model Card Contact
139
+
140
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
+ -->
config.json CHANGED
@@ -35,9 +35,11 @@
35
  "num_hidden_layers": 22,
36
  "pad_token_id": 50283,
37
  "position_embedding_type": "absolute",
 
38
  "sep_token_id": 50282,
39
- "tie_word_embeddings": true,
 
40
  "torch_dtype": "float32",
41
- "transformers_version": "4.48.0",
42
  "vocab_size": 50368
43
- }
 
35
  "num_hidden_layers": 22,
36
  "pad_token_id": 50283,
37
  "position_embedding_type": "absolute",
38
+ "repad_logits_with_grad": false,
39
  "sep_token_id": 50282,
40
+ "sparse_pred_ignore_index": -100,
41
+ "sparse_prediction": false,
42
  "torch_dtype": "float32",
43
+ "transformers_version": "4.52.4",
44
  "vocab_size": 50368
45
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.52.4",
5
+ "pytorch": "2.7.1"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc358f9c17ba979b157a0486e16820d8f84c84a18c71489159d929857906be38
3
+ size 596472567
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json CHANGED
@@ -34,4 +34,4 @@
34
  "rstrip": false,
35
  "single_word": false
36
  }
37
- }
 
34
  "rstrip": false,
35
  "single_word": false
36
  }
37
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -931,6 +931,7 @@
931
  },
932
  "clean_up_tokenization_spaces": true,
933
  "cls_token": "[CLS]",
 
934
  "mask_token": "[MASK]",
935
  "model_input_names": [
936
  "input_ids",
@@ -939,6 +940,6 @@
939
  "model_max_length": 8192,
940
  "pad_token": "[PAD]",
941
  "sep_token": "[SEP]",
942
- "tokenizer_class": "PreTrainedTokenizerFast",
943
  "unk_token": "[UNK]"
944
- }
 
931
  },
932
  "clean_up_tokenization_spaces": true,
933
  "cls_token": "[CLS]",
934
+ "extra_special_tokens": {},
935
  "mask_token": "[MASK]",
936
  "model_input_names": [
937
  "input_ids",
 
940
  "model_max_length": 8192,
941
  "pad_token": "[PAD]",
942
  "sep_token": "[SEP]",
943
+ "tokenizer_class": "PreTrainedTokenizer",
944
  "unk_token": "[UNK]"
945
+ }