nanochat-d26 (Base Pretrained)
A 1.68B parameter GPT-style base language model trained from scratch using nanochat. This is the pretrained base model โ for the chat-finetuned version, see carlosaguayo/nanochat-d26-sft.
Model Details
| Property | Value |
|---|---|
| Parameters | 1,681,790,292 |
| Layers | 26 |
| Embedding dim | 1664 |
| Attention heads | 13 (Q) / 13 (KV) |
| Head dim | 128 |
| Sequence length | 2048 |
| Vocab size | 32,768 |
| Window pattern | SSSL (sliding + global) |
| Precision | BFloat16 |
| Checkpoint step | 7,226 |
Training
| Property | Value |
|---|---|
| Dataset | FineWeb-EDU |
| Training tokens | 7,577,010,176 (~7.6B) |
| Token:param ratio | 8.25:1 |
| Hardware | 8x NVIDIA H100 80GB HBM3 |
| Training time | 163 minutes |
| FP8 training | Yes (tensorwise) |
| MFU | 60.44% |
| Optimizer | Muon + AdamW |
| Total FLOPs | 4.69e19 |
| Final val bpb | 0.7465 |
Hyperparameters
- Matrix LR: 0.02, Embedding LR: 0.3, Unembedding LR: 0.004, Scalar LR: 0.5
- Adam betas: (0.8, 0.95), Weight decay: 0.2
- Batch size: 1,048,576 tokens
- Warmdown: 50% of training, no warmup
Evaluation (Base CORE)
| Task | Score |
|---|---|
| CORE metric | 0.2576 |
| HellaSwag (zero-shot) | 0.3556 |
| ARC-Easy | 0.5735 |
| ARC-Challenge | 0.1650 |
| LAMBADA | 0.4283 |
| BigBench QA Wikidata | 0.5229 |
| Winograd | 0.4139 |
| PIQA | 0.4124 |
| SQuAD | 0.3672 |
| CoQA | 0.2607 |
| Jeopardy | 0.1937 |
| COPA | 0.2800 |
Sample Completions
The capital of France is Paris, and it is the largest city in the country.
The chemical symbol of gold is Au. It is a soft, malleable, ductile, and precious...
The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune
Tokenizer
This model uses a custom 32,768-token BPE tokenizer trained on FineWeb-EDU. The tokenizer files are included in the tokenizer/ directory. Important: This tokenizer is different from the 65k-vocab tokenizer used by karpathy/nanochat-d32 โ you must use the tokenizer bundled with this model.
Usage
# Download
huggingface-cli download carlosaguayo/nanochat-d26 --local-dir /tmp/nanochat-d26
# Install files
mkdir -p ~/.cache/nanochat/base_checkpoints/d26
cp /tmp/nanochat-d26/model_007226.pt ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/meta_007226.json ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/tokenizer/* ~/.cache/nanochat/tokenizer/
from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine
model, tokenizer, meta = load_model("base", device, phase="eval", model_tag="d26")
engine = Engine(model, tokenizer)
bos = tokenizer.get_bos_token_id()
tokens = [bos] + tokenizer.encode("The capital of France is")
for token_column, _ in engine.generate(tokens, num_samples=1, max_tokens=30, temperature=0.8, top_k=50):
print(tokenizer.decode([token_column[0]]), end="", flush=True)
Architecture
Nanochat GPT with:
- Rotary positional embeddings (RoPE)
- QK normalization (RMSNorm)
- ReLU-squared activation
- Untied input/output embeddings
- Logit softcapping
- Sliding window attention (SSSL pattern: 3 sliding + 1 global)
- Flash Attention 3 on Hopper+, PyTorch SDPA fallback elsewhere
License
MIT