ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
ShotStream is a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. It achieves sub-second latency and 16 FPS on a single NVIDIA GPU by reformulating the task as next-shot generation conditioned on historical context.
Project Page | Paper | Code
Introduction
Multi-shot video generation is crucial for long narrative storytelling. ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. It preserves visual coherence through a dual-cache memory mechanism and mitigates error accumulation using a two-stage self-forcing distillation strategy (Distribution Matching Distillation).
Usage
Training and inference code, as well as the models, are all released. For the full implementation and training details, please refer to the official GitHub repository.
1. Environment Setup
git clone https://github.com/KlingAIResearch/ShotStream.git
cd ShotStream
# Setup environment using the provided script
bash tools/setup/env.sh
2. Download Checkpoints
# Download the checkpoints of Wan-T2V-1.3B and ShotStream
bash tools/setup/download_ckpt.sh
3. Run Inference
To perform autoregressive 4-step long multi-shot video generation:
bash tools/inference/causal_fewsteps.sh
Citation
If you find our work helpful, please cite our paper:
@article{luo2026shotstream,
title={ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling},
author={Luo, Yawen and Shi, Xiaoyu and Zhuang, Junhao and Chen, Yutian and Liu, Quande and Wang, Xintao and Wan, Pengfei and Xue, Tianfan},
journal={arXiv preprint arXiv:2603.25746},
year={2026}
}
Model tree for KlingTeam/ShotStream
Base model
Wan-AI/Wan2.1-T2V-1.3B