Dy-SViT: Dynamic Spiking Vision Transformer
Dy-SViT is a pure Spiking Vision Transformer architecture that achieves 73.23% accuracy on CIFAR-10 using Single-Step Inference (T=1).
This model overcomes the "Vanishing Gradient" problem inherent in single-step SNNs by introducing Dynamic Surrogate Gradients. Instead of using a fixed surrogate slope (e.g., k=2.0) for backpropagation, Dy-SViT employs a lightweight hyper-network to meta-learn the optimal slope per-token during training.
Research Context
State-of-the-art Spiking Neural Networks (SNNs) typically require multi-step accumulation (T >= 4) or heavy Convolutional tokenizers to converge. Dy-SViT demonstrates that a pure Transformer backbone can converge in a single time step if the surrogate gradient slope is learned dynamically.
Key Contributions
- Single-Step Inference (T=1): Maximizes throughput and minimizes latency.
- Pure Architecture: Uses standard Patch Embeddings, avoiding hybrid CNN backbones.
- Dynamic Precision: Automatically sharpens gradients for distinct features and relaxes them for noise.
Model Architecture
- Backbone: Vision Transformer (Depth=4, Dim=256).
- Neuron:
SpikingNeuronwith Bounded Dynamic Slope [1.0, 5.0]. - Normalization: Pre-Norm (
BatchNorm1dbefore attention). - Stability: Residual Scaling (0.5) + NaN Guards.
Training Configuration
The model was trained using the following stability protocol to prevent signal explosion:
- Optimizer: AdamW
- Scheduler: Cosine Annealing with Linear Warmup (5 epochs).
- Augmentation: RandomCrop, HorizontalFlip, Mixup (Alpha=0.2).
- Mixup Cooldown: Disabled for the final 15 epochs.
- Gradient Clipping: 0.5.
Performance Results (CIFAR-10)
| Model Variant | Time Steps (T) | Stability Protocol | Accuracy |
|---|---|---|---|
| Static Baseline | 1 | None | 10.02% (Collapse) |
| Dy-SViT (Ours) | 1 | Pre-Norm + NaN Guard | 73.23% |
Figure 1: Learned Precision Maps. The network dynamically sharpens its surrogate gradients (Yellow) on semantically relevant features (edges, objects) while relaxing precision (Purple) on backgrounds.
Usage
To use this model, load the architecture from model.py and the weights from this repository.
import torch
from model import DySViT
from huggingface_hub import hf_hub_download
# Initialize Architecture
model = DySViT(num_classes=10, dim=256, depth=4)
# Load Weights
weights_path = hf_hub_download(repo_id="philipp-zettl/Dy-SViT-CIFAR10", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
# Input should be normalized: mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010)
input_tensor = torch.randn(1, 3, 32, 32)
logits = model(input_tensor)
Citation
@misc{dysvit2026,
title={Dynamic Precision: One-Shot Spiking Transformers via Input-Dependent Surrogate Gradients},
author={Philipp Zettl},
year={2026},
publisher={Hugging Face}
}