docs: add new section with tokenizer stats
Browse files
README.md
CHANGED
|
@@ -16,25 +16,17 @@ Following the original nanochat tokenizer training process, we trained the token
|
|
| 16 |
python -m scripts.tok_train --max_chars=2000000000
|
| 17 |
```
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
2025-10-19 22:25:31,239 - rustbpe - INFO - Starting merge loop
|
| 31 |
-
...
|
| 32 |
-
2025-10-19 22:25:50,556 - rustbpe - INFO - Progress: 100% (65271/65271 merges) - Last merge: (2065, 489) -> 65526 (frequency: 355)
|
| 33 |
-
2025-10-19 22:25:50,556 - rustbpe - INFO - Finished training: 65271 merges completed
|
| 34 |
-
Training time: 45.47s
|
| 35 |
-
Saved tokenizer encoding to /home/stefan/.cache/nanochat/tokenizer/tokenizer.pkl
|
| 36 |
-
Saved token_bytes to /home/stefan/.cache/nanochat/tokenizer/token_bytes.pt
|
| 37 |
-
```
|
| 38 |
|
| 39 |
## Evaluation
|
| 40 |
|
|
|
|
| 16 |
python -m scripts.tok_train --max_chars=2000000000
|
| 17 |
```
|
| 18 |
|
| 19 |
+
## Stats
|
| 20 |
+
|
| 21 |
+
- max_chars: 2,000,000,000
|
| 22 |
+
- doc_cap: 10,000
|
| 23 |
+
- vocab_size: 65,536
|
| 24 |
+
- train_time: 117.8557
|
| 25 |
+
- num_special_tokens: 9
|
| 26 |
+
- token_bytes_min: 1
|
| 27 |
+
- token_bytes_max: 66
|
| 28 |
+
- token_bytes_mean: 7.5642
|
| 29 |
+
- token_bytes_std: 3.6434
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Evaluation
|
| 32 |
|