nanochat-d26 (Base Pretrained)

A 1.68B parameter GPT-style base language model trained from scratch using nanochat. This is the pretrained base model โ€” for the chat-finetuned version, see carlosaguayo/nanochat-d26-sft.

Model Details

Property Value
Parameters 1,681,790,292
Layers 26
Embedding dim 1664
Attention heads 13 (Q) / 13 (KV)
Head dim 128
Sequence length 2048
Vocab size 32,768
Window pattern SSSL (sliding + global)
Precision BFloat16
Checkpoint step 7,226

Training

Property Value
Dataset FineWeb-EDU
Training tokens 7,577,010,176 (~7.6B)
Token:param ratio 8.25:1
Hardware 8x NVIDIA H100 80GB HBM3
Training time 163 minutes
FP8 training Yes (tensorwise)
MFU 60.44%
Optimizer Muon + AdamW
Total FLOPs 4.69e19
Final val bpb 0.7465

Hyperparameters

  • Matrix LR: 0.02, Embedding LR: 0.3, Unembedding LR: 0.004, Scalar LR: 0.5
  • Adam betas: (0.8, 0.95), Weight decay: 0.2
  • Batch size: 1,048,576 tokens
  • Warmdown: 50% of training, no warmup

Evaluation (Base CORE)

Task Score
CORE metric 0.2576
HellaSwag (zero-shot) 0.3556
ARC-Easy 0.5735
ARC-Challenge 0.1650
LAMBADA 0.4283
BigBench QA Wikidata 0.5229
Winograd 0.4139
PIQA 0.4124
SQuAD 0.3672
CoQA 0.2607
Jeopardy 0.1937
COPA 0.2800

Sample Completions

The capital of France is Paris, and it is the largest city in the country.

The chemical symbol of gold is Au. It is a soft, malleable, ductile, and precious...

The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune

Tokenizer

This model uses a custom 32,768-token BPE tokenizer trained on FineWeb-EDU. The tokenizer files are included in the tokenizer/ directory. Important: This tokenizer is different from the 65k-vocab tokenizer used by karpathy/nanochat-d32 โ€” you must use the tokenizer bundled with this model.

Usage

# Download
huggingface-cli download carlosaguayo/nanochat-d26 --local-dir /tmp/nanochat-d26

# Install files
mkdir -p ~/.cache/nanochat/base_checkpoints/d26
cp /tmp/nanochat-d26/model_007226.pt ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/meta_007226.json ~/.cache/nanochat/base_checkpoints/d26/
cp /tmp/nanochat-d26/tokenizer/* ~/.cache/nanochat/tokenizer/
from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine

model, tokenizer, meta = load_model("base", device, phase="eval", model_tag="d26")
engine = Engine(model, tokenizer)

bos = tokenizer.get_bos_token_id()
tokens = [bos] + tokenizer.encode("The capital of France is")
for token_column, _ in engine.generate(tokens, num_samples=1, max_tokens=30, temperature=0.8, top_k=50):
    print(tokenizer.decode([token_column[0]]), end="", flush=True)

Architecture

Nanochat GPT with:

  • Rotary positional embeddings (RoPE)
  • QK normalization (RMSNorm)
  • ReLU-squared activation
  • Untied input/output embeddings
  • Logit softcapping
  • Sliding window attention (SSSL pattern: 3 sliding + 1 global)
  • Flash Attention 3 on Hopper+, PyTorch SDPA fallback elsewhere

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support