Dy-SViT: Dynamic Spiking Vision Transformer

Dy-SViT is a pure Spiking Vision Transformer architecture that achieves 73.23% accuracy on CIFAR-10 using Single-Step Inference (T=1).

This model overcomes the "Vanishing Gradient" problem inherent in single-step SNNs by introducing Dynamic Surrogate Gradients. Instead of using a fixed surrogate slope (e.g., k=2.0) for backpropagation, Dy-SViT employs a lightweight hyper-network to meta-learn the optimal slope per-token during training.

Research Context

State-of-the-art Spiking Neural Networks (SNNs) typically require multi-step accumulation (T >= 4) or heavy Convolutional tokenizers to converge. Dy-SViT demonstrates that a pure Transformer backbone can converge in a single time step if the surrogate gradient slope is learned dynamically.

Key Contributions

  1. Single-Step Inference (T=1): Maximizes throughput and minimizes latency.
  2. Pure Architecture: Uses standard Patch Embeddings, avoiding hybrid CNN backbones.
  3. Dynamic Precision: Automatically sharpens gradients for distinct features and relaxes them for noise.

Model Architecture

  • Backbone: Vision Transformer (Depth=4, Dim=256).
  • Neuron: SpikingNeuron with Bounded Dynamic Slope [1.0, 5.0].
  • Normalization: Pre-Norm (BatchNorm1d before attention).
  • Stability: Residual Scaling (0.5) + NaN Guards.

Training Configuration

The model was trained using the following stability protocol to prevent signal explosion:

  • Optimizer: AdamW
  • Scheduler: Cosine Annealing with Linear Warmup (5 epochs).
  • Augmentation: RandomCrop, HorizontalFlip, Mixup (Alpha=0.2).
  • Mixup Cooldown: Disabled for the final 15 epochs.
  • Gradient Clipping: 0.5.

Performance Results (CIFAR-10)

Model Variant Time Steps (T) Stability Protocol Accuracy
Static Baseline 1 None 10.02% (Collapse)
Dy-SViT (Ours) 1 Pre-Norm + NaN Guard 73.23%

Precision Heatmap Figure 1: Learned Precision Maps. The network dynamically sharpens its surrogate gradients (Yellow) on semantically relevant features (edges, objects) while relaxing precision (Purple) on backgrounds.

Usage

To use this model, load the architecture from model.py and the weights from this repository.

import torch
from model import DySViT
from huggingface_hub import hf_hub_download

# Initialize Architecture
model = DySViT(num_classes=10, dim=256, depth=4)

# Load Weights
weights_path = hf_hub_download(repo_id="philipp-zettl/Dy-SViT-CIFAR10", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))

model.eval()
# Input should be normalized: mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010)
input_tensor = torch.randn(1, 3, 32, 32) 
logits = model(input_tensor)

Citation

@misc{dysvit2026,
  title={Dynamic Precision: One-Shot Spiking Transformers via Input-Dependent Surrogate Gradients},
  author={Philipp Zettl},
  year={2026},
  publisher={Hugging Face}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train philipp-zettl/Dy-SViT-CIFAR10

Space using philipp-zettl/Dy-SViT-CIFAR10 1