ebrarkiziloglu commited on
Commit
7fde2b6
·
verified ·
1 Parent(s): da55ad4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -31,13 +31,9 @@ inference: false
31
 
32
  ## Model Summary
33
 
34
- **TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
35
- It has been **pre-trained on 86 billion tokens** of diverse data, primarily:
36
-
37
- - A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic
38
- texts.
39
- - **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens.
40
-
41
  TabiBERT inherits ModernBERT’s architectural improvements, such as:
42
 
43
  - **Rotary Positional Embeddings (RoPE)** for long-context support.
@@ -75,13 +71,17 @@ pip install flash-attn
75
  Example usage with `AutoModelForMaskedLM`:
76
  ```py
77
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
78
 
79
  model_id = "boun-tabilab/TabiBERT"
80
  tokenizer = AutoTokenizer.from_pretrained(model_id)
81
  model = AutoModelForMaskedLM.from_pretrained(model_id)
82
 
 
 
 
83
  text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
84
- inputs = tokenizer(text, return_tensors="pt")
85
  outputs = model(**inputs)
86
 
87
  masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
@@ -96,10 +96,19 @@ from transformers import pipeline
96
 
97
  pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
98
 
99
- print(pipe("[MASK], Türkiye Cumhuriyeti'nin ilk başkentidir."))
100
 
101
  ```
102
 
 
 
 
 
 
 
 
 
 
103
  ## Evaluation
104
 
105
  Evaluations are in progress.
 
31
 
32
  ## Model Summary
33
 
34
+ **TabiBERT** is a modernized encoder-only Transformer model (BERT-style) based on the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
35
+ TabiBERT is pre-trained on **1 trillion tokens** of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.
36
+
 
 
 
 
37
  TabiBERT inherits ModernBERT’s architectural improvements, such as:
38
 
39
  - **Rotary Positional Embeddings (RoPE)** for long-context support.
 
71
  Example usage with `AutoModelForMaskedLM`:
72
  ```py
73
  from transformers import AutoTokenizer, AutoModelForMaskedLM
74
+ import torch
75
 
76
  model_id = "boun-tabilab/TabiBERT"
77
  tokenizer = AutoTokenizer.from_pretrained(model_id)
78
  model = AutoModelForMaskedLM.from_pretrained(model_id)
79
 
80
+ device = "cuda" if torch.cuda.is_available() else "cpu"
81
+ model = model.to(device)
82
+
83
  text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
84
+ inputs = tokenizer(text, return_tensors="pt").to(device)
85
  outputs = model(**inputs)
86
 
87
  masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
 
96
 
97
  pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
98
 
99
+ print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))
100
 
101
  ```
102
 
103
+ ## Pre-training Data
104
+
105
+ TabiBERT has been **pre-trained on 86 billion tokens** of diverse data, primarily:
106
+
107
+ - A large-scale **Turkish corpus** covering literature, news, social media, Wikipedia, and academic
108
+ texts.
109
+ - **English text**, ** code with English commentary**, and **math problems in English** — together making up about **13% non-Turkish** tokens.
110
+
111
+
112
  ## Evaluation
113
 
114
  Evaluations are in progress.