nanochat-AquaRat / GCS_UPLOAD_GUIDE.md
HarleyCooper's picture
Sync docs from GitHub
af7be05 verified

Google Cloud Storage Upload Guide

After training your nanochat model on Lambda Labs, use this guide to upload all weights and artifacts to Google Cloud Storage.

Quick Start

# After training completes, SSH to your Lambda instance
ssh ubuntu@<INSTANCE_IP>

# Navigate to project directory
cd ~/nanochatAquaRat

# Run upload script
bash scripts/upload_to_gcs.sh --bucket gs://your-bucket-name

The script will:

  1. Check/install gcloud CLI if needed
  2. Verify authentication and bucket access
  3. Show what will be uploaded and ask for confirmation
  4. Upload all artifacts with progress
  5. Ask if you want to terminate the Lambda instance

Prerequisites

1. Create a GCS Bucket

# From your local machine
gcloud storage buckets create gs://your-bucket-name \
  --location=us-central1 \
  --uniform-bucket-level-access

Or create via console: https://console.cloud.google.com/storage/create-bucket

2. Set Up Authentication

Option A: Service Account (Recommended for Automation)

On your local machine:

# Create service account
gcloud iam service-accounts create nanochat-uploader \
  --display-name="Nanochat Model Uploader"

# Grant storage permissions
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:nanochat-uploader@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectCreator"

# Create and download key
gcloud iam service-accounts keys create ~/nanochat-key.json \
  --iam-account=nanochat-uploader@YOUR_PROJECT_ID.iam.gserviceaccount.com

Copy key to Lambda instance:

scp ~/nanochat-key.json ubuntu@<INSTANCE_IP>:~/

On Lambda instance:

gcloud auth activate-service-account --key-file=~/nanochat-key.json

Option B: User Account (Simpler for Manual Use)

On Lambda instance:

gcloud auth login
# Follow the prompts in your browser

Usage

Basic Upload

bash scripts/upload_to_gcs.sh --bucket gs://my-models

Custom Run Name

bash scripts/upload_to_gcs.sh \
  --bucket gs://my-models \
  --run-name depth20-experiment1

Exclude Large Dataset Files

bash scripts/upload_to_gcs.sh \
  --bucket gs://my-models \
  --exclude-data

Dry Run (Preview Only)

bash scripts/upload_to_gcs.sh \
  --bucket gs://my-models \
  --dry-run

Auto-Terminate After Upload

bash scripts/upload_to_gcs.sh \
  --bucket gs://my-models \
  --auto-terminate

What Gets Uploaded

From ~/.cache/nanochat/:

Directory Contents Typical Size
checkpoints/ Model weights (.pt, .pkl files) 500MB - 2GB
report/ Training reports and markdown summaries 1-10MB
tokenizer/ BPE tokenizer files 10-50MB
eval_bundle/ Evaluation datasets 50-200MB
aqua/ AQuA-RAT dataset (optional) 100-500MB
mechanistic_interpretability/ DeepMind interp tools 10-100MB

Total: Typically 1-5 GB per training run

Upload Structure

Files are organized in GCS as:

gs://your-bucket/
└── runs/
    β”œβ”€β”€ aquarat-20251023-143022/
    β”‚   β”œβ”€β”€ checkpoints/
    β”‚   β”‚   β”œβ”€β”€ base_final.pt
    β”‚   β”‚   β”œβ”€β”€ mid_final.pt
    β”‚   β”‚   β”œβ”€β”€ sft_final.pt
    β”‚   β”‚   └── rl_final.pt
    β”‚   β”œβ”€β”€ report/
    β”‚   β”‚   └── report.md
    β”‚   β”œβ”€β”€ tokenizer/
    β”‚   └── ...
    └── depth20-experiment1/
        └── ...

Download Weights Later

Download Entire Run

gsutil -m rsync -r \
  gs://your-bucket/runs/aquarat-20251023-143022/ \
  ./local_checkpoints/

Download Just Checkpoints

gsutil -m cp -r \
  gs://your-bucket/runs/aquarat-20251023-143022/checkpoints/ \
  ./checkpoints/

Download Single File

gsutil cp \
  gs://your-bucket/runs/aquarat-20251023-143022/checkpoints/rl_final.pt \
  ./rl_final.pt

Cost Considerations

Storage Costs

  • Standard storage: ~$0.02/GB/month
  • Nearline storage (30+ days): ~$0.01/GB/month
  • Coldline storage (90+ days): ~$0.004/GB/month

Example: 2GB model stored for 1 month = $0.04

Network Egress

  • Upload (ingress): Free
  • Download to same region: Free
  • Download to internet: ~$0.12/GB

Tip: Keep your GCS bucket in the same region as your compute for free transfers.

Lifecycle Management

Auto-delete or move to cheaper storage after 90 days:

cat > lifecycle.json << EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {"age": 90}
      }
    ]
  }
}
EOF

gsutil lifecycle set lifecycle.json gs://your-bucket

Troubleshooting

"gcloud: command not found"

The script auto-installs gcloud on Linux. If it fails:

curl https://sdk.cloud.google.com | bash
exec -l $SHELL

"Permission denied" Error

Check your service account has roles/storage.objectCreator:

gcloud projects get-iam-policy YOUR_PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:nanochat-uploader*"

Upload Interrupted

The script uses gsutil rsync, so re-running will resume:

bash scripts/upload_to_gcs.sh --bucket gs://your-bucket
# Will skip already-uploaded files

Verify Upload

# List all files in the run
gsutil ls -r gs://your-bucket/runs/your-run-name/

# Check specific checkpoints
gsutil ls gs://your-bucket/runs/your-run-name/checkpoints/

Integration with Lambda Launcher

You can add GCS credentials to the automated launcher:

# In scripts/launch_lambda_training.py
# Add to the cloud-init user-data:

write_files:
  - path: /home/ubuntu/.config/gcloud/application_default_credentials.json
    content: |
      {your service account key JSON}

Or pass as environment variable:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

python scripts/launch_lambda_training.py \
  --inject-env GOOGLE_APPLICATION_CREDENTIALS \
  ...

Best Practices

  1. Name runs descriptively: Use --run-name depth20-lr1e4-batch32
  2. Exclude data when iterating: Use --exclude-data to save bandwidth
  3. Dry run first: Always use --dry-run to preview
  4. Service accounts for automation: Easier than user auth
  5. Regional buckets: Match Lambda instance region when possible
  6. Lifecycle policies: Auto-archive old models
  7. Download to Lambda: If re-training, download previous checkpoints to Lambda first

Security Notes

  • Service account keys are sensitive - treat like passwords
  • Use least-privilege IAM roles (don't grant roles/owner)
  • Rotate service account keys regularly
  • Consider Workload Identity if using GKE
  • Don't commit keys to git (add to .gitignore)

Support


Quick Reference:

# Upload
bash scripts/upload_to_gcs.sh --bucket gs://my-bucket

# Download
gsutil -m cp -r gs://my-bucket/runs/NAME/checkpoints/ ./

# List runs
gsutil ls gs://my-bucket/runs/

# Delete old run
gsutil -m rm -r gs://my-bucket/runs/old-run/