stefan-it
/

nanochat-german-tokenizer

stefan-it commited on Oct 23, 2025

Commit

23a4e19

verified ·

1 Parent(s): f9189df

docs: add new section with tokenizer stats

Files changed (1) hide show

README.md CHANGED Viewed

@@ -16,25 +16,17 @@ Following the original nanochat tokenizer training process, we trained the token
 python -m scripts.tok_train --max_chars=2000000000
 ```
-Here's an excerpt from the training log:
-```
-max_chars: 2,000,000,000
-doc_cap: 10,000
-vocab_size: 65,536
-2025-10-19 22:25:06,054 - rustbpe - INFO - Processing sequences from iterator (buffer_size: 8192)
-2025-10-19 22:25:29,319 - rustbpe - INFO - Processed 359722 sequences total, 4328904 unique
-2025-10-19 22:25:29,450 - rustbpe - INFO - Starting BPE training: 65271 merges to compute
-2025-10-19 22:25:29,450 - rustbpe - INFO - Computing initial pair counts from 4328904 unique sequences
-2025-10-19 22:25:31,238 - rustbpe - INFO - Building heap with 17271 unique pairs
-2025-10-19 22:25:31,239 - rustbpe - INFO - Starting merge loop
-...
-2025-10-19 22:25:50,556 - rustbpe - INFO - Progress: 100% (65271/65271 merges) - Last merge: (2065, 489) -> 65526 (frequency: 355)
-2025-10-19 22:25:50,556 - rustbpe - INFO - Finished training: 65271 merges completed
-Training time: 45.47s
-Saved tokenizer encoding to /home/stefan/.cache/nanochat/tokenizer/tokenizer.pkl
-Saved token_bytes to /home/stefan/.cache/nanochat/tokenizer/token_bytes.pt
-```
 ## Evaluation

 python -m scripts.tok_train --max_chars=2000000000
 ```
+## Stats
+- max_chars: 2,000,000,000
+- doc_cap: 10,000
+- vocab_size: 65,536
+- train_time: 117.8557
+- num_special_tokens: 9
+- token_bytes_min: 1
+- token_bytes_max: 66
+- token_bytes_mean: 7.5642
+- token_bytes_std: 3.6434
 ## Evaluation