stefan-it commited on
Commit
23a4e19
·
verified ·
1 Parent(s): f9189df

docs: add new section with tokenizer stats

Browse files
Files changed (1) hide show
  1. README.md +11 -19
README.md CHANGED
@@ -16,25 +16,17 @@ Following the original nanochat tokenizer training process, we trained the token
16
  python -m scripts.tok_train --max_chars=2000000000
17
  ```
18
 
19
- Here's an excerpt from the training log:
20
-
21
- ```
22
- max_chars: 2,000,000,000
23
- doc_cap: 10,000
24
- vocab_size: 65,536
25
- 2025-10-19 22:25:06,054 - rustbpe - INFO - Processing sequences from iterator (buffer_size: 8192)
26
- 2025-10-19 22:25:29,319 - rustbpe - INFO - Processed 359722 sequences total, 4328904 unique
27
- 2025-10-19 22:25:29,450 - rustbpe - INFO - Starting BPE training: 65271 merges to compute
28
- 2025-10-19 22:25:29,450 - rustbpe - INFO - Computing initial pair counts from 4328904 unique sequences
29
- 2025-10-19 22:25:31,238 - rustbpe - INFO - Building heap with 17271 unique pairs
30
- 2025-10-19 22:25:31,239 - rustbpe - INFO - Starting merge loop
31
- ...
32
- 2025-10-19 22:25:50,556 - rustbpe - INFO - Progress: 100% (65271/65271 merges) - Last merge: (2065, 489) -> 65526 (frequency: 355)
33
- 2025-10-19 22:25:50,556 - rustbpe - INFO - Finished training: 65271 merges completed
34
- Training time: 45.47s
35
- Saved tokenizer encoding to /home/stefan/.cache/nanochat/tokenizer/tokenizer.pkl
36
- Saved token_bytes to /home/stefan/.cache/nanochat/tokenizer/token_bytes.pt
37
- ```
38
 
39
  ## Evaluation
40
 
 
16
  python -m scripts.tok_train --max_chars=2000000000
17
  ```
18
 
19
+ ## Stats
20
+
21
+ - max_chars: 2,000,000,000
22
+ - doc_cap: 10,000
23
+ - vocab_size: 65,536
24
+ - train_time: 117.8557
25
+ - num_special_tokens: 9
26
+ - token_bytes_min: 1
27
+ - token_bytes_max: 66
28
+ - token_bytes_mean: 7.5642
29
+ - token_bytes_std: 3.6434
 
 
 
 
 
 
 
 
30
 
31
  ## Evaluation
32