Pushing Onnx model to Hugging Face Hub

by louisbrulenaudet - opened Jun 22

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+200200

-50141

Files changed (10) hide show

1_Pooling/config.json +10 -0
README.md +114 -126
config.json +5 -3
config_sentence_transformers.json +10 -0
modules.json +14 -0
onnx/model.onnx +3 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +1 -1
tokenizer.json +0 -0
tokenizer_config.json +3 -2

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md CHANGED Viewed

@@ -1,153 +1,141 @@
 ---
-license: mit
-language:
-- en
-base_model:
-- answerdotai/ModernBERT-base
-pipeline_tag: fill-mask
 tags:
-- fill-mask
-- masked-lm
-- long-context
-- modernbert
-- BioClinical-ModernBERT
-library_name: transformers
 ---
-# BioClinical ModernBERT
-*BioClinical ModernBERT is available in two sizes: [base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base) (150M parameters) and [large](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-large) (396M parameters). The model training checkpoints can be found [here](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-checkpoints), and our code is available in our [GitHub repository](https://github.com/lindvalllab/BioClinical-ModernBERT).*
-## Table of Contents
-1. [Model Summary](#model-summary)
-2. [Usage](#usage)
-3. [Training](#training)
-4. [Evaluation](#evaluation)
-5. [License](#license)
-6. [Citation](#citation)
-## Model Summary
-BioClinical ModernBERT is a domain-adapted encoder that builds on ModernBERT [base](https://huggingface.co/answerdotai/ModernBERT-base) and [large](https://huggingface.co/answerdotai/ModernBERT-large), incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is trained on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source.
-## Usage
-You can use these models directly with the `transformers` library starting from v4.48.0:
-```sh
-pip install -U transformers>=4.48.0
 ```
-Since BioClinical ModernBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. To use BioClinical ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.
-**⚠️ If your GPU supports it, we recommend using BioClinical ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
 ```bash
-pip install flash-attn
 ```
-Using `AutoModelForMaskedLM`:
 ```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-model_id = "thomas-sounack/BioClinical-ModernBERT-base"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForMaskedLM.from_pretrained(model_id)
-text = "Mitochondria is the powerhouse of the [MASK]."
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model(**inputs)
-# To get predictions for the mask:
-masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
-predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-print("Predicted token:", predicted_token)
-# Predicted token:  cell
 ```
-Using a pipeline:
-```python
-import torch
-from transformers import pipeline
-from pprint import pprint
-pipe = pipeline(
-    "fill-mask",
-    model="thomas-sounack/BioClinical-ModernBERT-base",
-    torch_dtype=torch.bfloat16,
-)
-input_text = "[MASK] is a disease caused by an uncontrolled division of abnormal cells in a part of the body."
-results = pipe(input_text)
-pprint(results)
-```
-**Note:** BioClinical ModernBERT, similarly to ModernBERT, does not use token type IDs unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the `token_type_ids` parameter.
-## Training
-### Data
-BioClinical ModernBERT is trained on 50.7B tokens of biomedical text gathered from PubMed and PMC, and 2.8B tokens of clinical text from 20 datasets which are detailed in the table below.
-| Name                       | Country      | Clinical Source                    | Clinical Context      | Samples   | Tokens (M) |
-|----------------------------|--------------|------------------------------------|-----------------------|-----------|------------|
-| ACI-BENCH                  | US           | Clinical Notes                     | Not Reported          | 207       | 0.1        |
-| ADE Corpus                 | Several      | Clinical Notes                     | Not Reported          | 20,896    | 0.5        |
-| Brain MRI Stroke           | Korea        | Radiology Reports                  | Neurology             | 2,603     | 0.2        |
-| CheXpert Plus              | US           | Radiology Reports                  | Pulmonology           | 223,460   | 60.6       |
-| CHIFIR                     | Australia    | Pathology Reports                  | Hematology / Oncology | 283       | 0.1        |
-| CORAL                      | US           | Progress Notes                     | Hematology / Oncology | 240       | 0.7        |
-| Eye Gaze CXR               | US           | Radiology Reports                  | Pulmonology           | 892       | 0.03       |
-| Gout Chief Complaints      | US           | Chief Complaint                    | Internal Medicine     | 8,429     | 0.2        |
-| ID-68                      | UK           | Clinical Notes                     | Psychology            | 78        | 0.02       |
-| Inspect                    | US           | Radiology Reports                  | Pulmonology           | 22,259    | 2.8        |
-| MedNLI                     | US           | Clinical Notes                     | Internal Medicine     | 14,047    | 0.5        |
-| MedQA                      | US           | National Medical Board Examination | Not Reported          | 14,366    | 2.0        |
-| MIMIC-III                  | US           | Clinical Notes                     | Internal Medicine     | 2,021,411 | 1,047.7    |
-| MIMIC-IV Note              | US           | Clinical Notes                     | Internal Medicine     | 2,631,243 | 1,765.7    |
-| MTSamples                  | Not Reported | Clinical Notes                     | Internal Medicine     | 2,358     | 1.7        |
-| Negex                      | US           | Discharge Summaries                | Not Reported          | 2,056     | 0.1        |
-| PriMock57                  | UK           | Simulated Patient Care             | Internal Medicine     | 57        | 0.01       |
-| Q-Pain                     | US           | Clinical Vignettes                 | Palliative Care       | 51        | 0.01       |
-| REFLACX                    | US           | Radiology Reports                  | Pulmonology           | 2,543     | 0.1        |
-| Simulated Resp. Interviews | Canada       | Simulated Patient Care             | Pulmonology           | 272       | 0.6        |
-### Methodology
-BioClinical ModernBERT base is trained in two phases. This model is initialized from the last stable-phase checkpoint of ModernBERT base and trained with the same hyperparameters: learning rate of 3e-4 and batch size of 72.
-- Phase 1: Training on 160.5B tokens from PubMed, PMC, and the 20 clinical datasets. Learning rate remains constant throughout this stage, and the masking probability is set at 30%.
-- Phase 2: Training on the 20 clinical datasets only. Masking probability is reduced to 15%. The model is trained for 3 epochs with a 1-sqrt learning rate decay.
-## Evaluation
-|       | Model                          | Context Length | ChemProt | Phenotype | COS      | Social History | DEID     |
-|-------|--------------------------------|----------------|----------|-----------|----------|----------------|----------|
-| Base  | BioBERT                        | 512            | 89.5     | 26.6      | 94.9     | 55.8           | 74.3     |
-|       | Clinical BERT                  | 512            | 88.3     | 25.8      | 95.0     | 55.2           | 74.2     |
-|       | BioMed-RoBERTa                 | 512            | 89.0     | 36.8      | 94.9     | 55.2           | 81.1     |
-|       | Clinical-BigBird               | 4096           | 87.4     | 26.5      | 94.0     | 53.3           | 71.2     |
-|       | Clinical-Longformer            | 4096           | 74.2     | 46.4      | **95.2** | 56.8           | 82.3     |
-|       | Clinical ModernBERT            | 8192           | 86.9     | 54.9      | 93.7     | 53.8           | 44.4     |
-|       | ModernBERT - base              | 8192           | 89.5     | 48.4      | 94.0     | 53.1           | 78.3     |
-|       | BioClinical ModernBERT - base  | 8192           | 89.9     | 58.1      | 95.1     | **58.5**       | 82.7     |
-| Large | ModernBERT - large             | 8192           | 90.2     | 58.3      | 94.4     | 54.8           | 82.1     |
-|       | BioClinical ModernBERT - large | 8192           | **90.8** | **60.8**  | 95.1     | 57.1           | **83.8** |
-## License
-We release the BioClinical ModernBERT base and large model weights and training checkpoints under the MIT license.
 ## Citation
-If you use BioClinical ModernBERT in your work, please cite our [preprint](https://arxiv.org/abs/2506.10896):
-```
-@misc{sounack2025bioclinicalmodernbertstateoftheartlongcontext,
-      title={BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP},
-      author={Thomas Sounack and Joshua Davis and Brigitte Durieux and Antoine Chaffin and Tom J. Pollard and Eric Lehman and Alistair E. W. Johnson and Matthew McDermott and Tristan Naumann and Charlotta Lindvall},
-      year={2025},
-      eprint={2506.10896},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2506.10896},
-}
-```

 ---
 tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+base_model: thomas-sounack/BioClinical-ModernBERT-base
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
 ---
+# SentenceTransformer based on thomas-sounack/BioClinical-ModernBERT-base
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [thomas-sounack/BioClinical-ModernBERT-base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [thomas-sounack/BioClinical-ModernBERT-base](https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-base) <!-- at revision 8ea6951dd0f48edbea0bdd3a081c78cada0ad70c -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ORTModelForFeatureExtraction
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
 ```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
 ```bash
+pip install -U sentence-transformers
 ```
+Then you can load this model and run inference.
 ```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("thomas-sounack/BioClinical-ModernBERT-base")
+# Run inference
+sentences = [
+    'The weather is lovely today.',
+    "It's so sunny outside!",
+    'He drove to the stadium.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
 ```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Framework Versions
+- Python: 3.13.5
+- Sentence Transformers: 4.1.0
+- Transformers: 4.52.4
+- PyTorch: 2.7.1
+- Accelerate:
+- Datasets: 3.6.0
+- Tokenizers: 0.21.1
 ## Citation
+### BibTeX
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json CHANGED Viewed

@@ -35,9 +35,11 @@
   "num_hidden_layers": 22,
   "pad_token_id": 50283,
   "position_embedding_type": "absolute",
   "sep_token_id": 50282,
-  "tie_word_embeddings": true,
   "torch_dtype": "float32",
-  "transformers_version": "4.48.0",
   "vocab_size": 50368
-}

   "num_hidden_layers": 22,
   "pad_token_id": 50283,
   "position_embedding_type": "absolute",
+  "repad_logits_with_grad": false,
   "sep_token_id": 50282,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
   "torch_dtype": "float32",
+  "transformers_version": "4.52.4",
   "vocab_size": 50368
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.1.0",
+    "transformers": "4.52.4",
+    "pytorch": "2.7.1"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc358f9c17ba979b157a0486e16820d8f84c84a18c71489159d929857906be38
+size 596472567

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 8192,
+  "do_lower_case": false
+}

special_tokens_map.json CHANGED Viewed

@@ -34,4 +34,4 @@
     "rstrip": false,
     "single_word": false
   }
-}

     "rstrip": false,
     "single_word": false
   }
+}

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -931,6 +931,7 @@
   },
   "clean_up_tokenization_spaces": true,
   "cls_token": "[CLS]",
   "mask_token": "[MASK]",
   "model_input_names": [
     "input_ids",
@@ -939,6 +940,6 @@
   "model_max_length": 8192,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
-  "tokenizer_class": "PreTrainedTokenizerFast",
   "unk_token": "[UNK]"
-}

   },
   "clean_up_tokenization_spaces": true,
   "cls_token": "[CLS]",
+  "extra_special_tokens": {},
   "mask_token": "[MASK]",
   "model_input_names": [
     "input_ids",
   "model_max_length": 8192,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizer",
   "unk_token": "[UNK]"
+}