nanochat-d26 (Base Pretrained)

A 1.68B parameter GPT-style base language model trained from scratch using nanochat. This is the pretrained base model — for the chat-finetuned version, see carlosaguayo/nanochat-d26-sft.

Model Details

Property	Value
Parameters	1,681,790,292
Layers	26
Embedding dim	1664
Attention heads	13 (Q) / 13 (KV)
Head dim	128
Sequence length	2048
Vocab size	32,768
Window pattern	SSSL (sliding + global)
Precision	BFloat16
Checkpoint step	7,226

Training

Property	Value
Dataset	FineWeb-EDU
Training tokens	7,577,010,176 (~7.6B)
Token:param ratio	8.25:1
Hardware	8x NVIDIA H100 80GB HBM3
Training time	163 minutes
FP8 training	Yes (tensorwise)
MFU	60.44%
Optimizer	Muon + AdamW
Total FLOPs	4.69e19
Final val bpb	0.7465

Hyperparameters

Matrix LR: 0.02, Embedding LR: 0.3, Unembedding LR: 0.004, Scalar LR: 0.5
Adam betas: (0.8, 0.95), Weight decay: 0.2
Batch size: 1,048,576 tokens
Warmdown: 50% of training, no warmup

Evaluation (Base CORE)

Task	Score
CORE metric	0.2576
HellaSwag (zero-shot)	0.3556
ARC-Easy	0.5735
ARC-Challenge	0.1650
LAMBADA	0.4283
BigBench QA Wikidata	0.5229
Winograd	0.4139
PIQA	0.4124
SQuAD	0.3672
CoQA	0.2607
Jeopardy	0.1937
COPA	0.2800

Sample Completions

The capital of France is Paris, and it is the largest city in the country.

The chemical symbol of gold is Au. It is a soft, malleable, ductile, and precious...

The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune

Tokenizer

This model uses a custom 32,768-token BPE tokenizer trained on FineWeb-EDU. The tokenizer files are included in the tokenizer/ directory. Important: This tokenizer is different from the 65k-vocab tokenizer used by karpathy/nanochat-d32 — you must use the tokenizer bundled with this model.

Usage

# Download
huggingface-cli download carlosaguayo/nanochat-d26 --local-dir /tmp/nanochat-d26

# Install files
mkdir -p ~/.cache/nanochat/base_checkpoints/d26
cp /tmp/nanochat-d26/model_007226.pt ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/meta_007226.json ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/tokenizer/* ~/.cache/nanochat/tokenizer/

from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine

model, tokenizer, meta = load_model("base", device, phase="eval", model_tag="d26")
engine = Engine(model, tokenizer)

bos = tokenizer.get_bos_token_id()
tokens = [bos] + tokenizer.encode("The capital of France is")
for token_column, _ in engine.generate(tokens, num_samples=1, max_tokens=30, temperature=0.8, top_k=50):
    print(tokenizer.decode([token_column[0]]), end="", flush=True)

Architecture

Nanochat GPT with:

Rotary positional embeddings (RoPE)
QK normalization (RMSNorm)
ReLU-squared activation
Untied input/output embeddings
Logit softcapping
Sliding window attention (SSSL pattern: 3 sliding + 1 global)
Flash Attention 3 on Hopper+, PyTorch SDPA fallback elsewhere

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track