Update README.md
Browse files
README.md
CHANGED
|
@@ -31,13 +31,9 @@ inference: false
|
|
| 31 |
|
| 32 |
## Model Summary
|
| 33 |
|
| 34 |
-
**TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
- A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic
|
| 38 |
-
texts.
|
| 39 |
-
- **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens.
|
| 40 |
-
|
| 41 |
TabiBERT inherits ModernBERT’s architectural improvements, such as:
|
| 42 |
|
| 43 |
- **Rotary Positional Embeddings (RoPE)** for long-context support.
|
|
@@ -75,13 +71,17 @@ pip install flash-attn
|
|
| 75 |
Example usage with `AutoModelForMaskedLM`:
|
| 76 |
```py
|
| 77 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
|
| 78 |
|
| 79 |
model_id = "boun-tabilab/TabiBERT"
|
| 80 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 81 |
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
| 82 |
|
|
|
|
|
|
|
|
|
|
| 83 |
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
|
| 84 |
-
inputs = tokenizer(text, return_tensors="pt")
|
| 85 |
outputs = model(**inputs)
|
| 86 |
|
| 87 |
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
|
|
@@ -96,10 +96,19 @@ from transformers import pipeline
|
|
| 96 |
|
| 97 |
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
|
| 98 |
|
| 99 |
-
print(pipe("[MASK], Türkiye Cumhuriyeti'nin
|
| 100 |
|
| 101 |
```
|
| 102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
## Evaluation
|
| 104 |
|
| 105 |
Evaluations are in progress.
|
|
|
|
| 31 |
|
| 32 |
## Model Summary
|
| 33 |
|
| 34 |
+
**TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
|
| 35 |
+
TabiBERT is pre-trained on **1 trillion tokens** of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.
|
| 36 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
TabiBERT inherits ModernBERT’s architectural improvements, such as:
|
| 38 |
|
| 39 |
- **Rotary Positional Embeddings (RoPE)** for long-context support.
|
|
|
|
| 71 |
Example usage with `AutoModelForMaskedLM`:
|
| 72 |
```py
|
| 73 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 74 |
+
import torch
|
| 75 |
|
| 76 |
model_id = "boun-tabilab/TabiBERT"
|
| 77 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 78 |
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
| 79 |
|
| 80 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 81 |
+
model = model.to(device)
|
| 82 |
+
|
| 83 |
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
|
| 84 |
+
inputs = tokenizer(text, return_tensors="pt").to(device)
|
| 85 |
outputs = model(**inputs)
|
| 86 |
|
| 87 |
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
|
|
|
|
| 96 |
|
| 97 |
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
|
| 98 |
|
| 99 |
+
print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))
|
| 100 |
|
| 101 |
```
|
| 102 |
|
| 103 |
+
## Pre-training Data
|
| 104 |
+
|
| 105 |
+
TabiBERT has been **pre-trained on 86 billion tokens** of diverse data, primarily:
|
| 106 |
+
|
| 107 |
+
- A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic
|
| 108 |
+
texts.
|
| 109 |
+
- **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens.
|
| 110 |
+
|
| 111 |
+
|
| 112 |
## Evaluation
|
| 113 |
|
| 114 |
Evaluations are in progress.
|