File size: 10,275 Bytes

---
license: apple-amlr
language:
- en
tags:
- normalizing-flows
- generative-models
- art
- autoregressive-models
---
# STARFlow: Scalable Transformer Auto-Regressive Flow

<div align="center">
  <img src="starflow_logo.png" alt="STARFlow Logo" width="300">
</div>

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2506.06276-b31b1b.svg)](https://arxiv.org/abs/2506.06276)
[![arXiv](https://img.shields.io/badge/arXiv-2511.20462-b31b1b.svg)](https://arxiv.org/abs/2511.20462)
[![NeurIPS](https://img.shields.io/badge/NeurIPS-2025%20Spotlight-blue.svg)](https://neurips.cc/Conferences/2025)

</div>

This is the official open source release of **STARFlow** and **STARFlow-V**, state-of-the-art transformer autoregressive flow models for high-quality image and video generation.

## 📖 Overview

**STARFlow** introduces a novel transformer autoregressive flow architecture that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. The model achieves state-of-the-art results in both text-to-image and text-to-video generation tasks.

- **[STARFlow](https://arxiv.org/abs/2506.06276)**:  Scaling Latent Normalizing Flows for High-resolution Image Synthesis (NeurIPS 2025 Spotlight)
- **[STARFlow-V](https://arxiv.org/abs/2511.20462)**: End-to-End Video Generative Modeling with Normalizing Flows (Arxiv)

🎬 **[View Video Results Gallery](https://starflow-v.github.io)** - See examples of generated videos and comparisons

## 🚀 Quick Start

### Environment Setup

```bash
# Clone the repository
git clone https://github.com/apple/ml-starflow
cd ml-starflow

# Set up conda environment (recommended)
bash scripts/setup_conda.sh

# Or install dependencies manually
pip install -r requirements.txt
```

### Model Checkpoints

**Important**: You'll need to download the pretrained model checkpoints and place them in the `ckpts/` directory. For example:

- `ckpts/starflow_3B_t2i_256x256.pth` - For text-to-image generation
- `ckpts/starflow-v_7B_t2v_caus_480p_v3.pth` - For text-to-video generation


### Text-to-Image Generation

Generate high-quality images from text prompts:

```bash
# Basic image generation (256x256)
bash scripts/test_sample_image.sh "a film still of a cat playing piano"

# Custom prompt and settings
torchrun --standalone --nproc_per_node 1 sample.py \
    --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
    --checkpoint_path "ckpts/starflow_3B_t2i_256x256.pth" \
    --caption "your custom prompt here" \
    --sample_batch_size 8 \
    --cfg 3.6 \
    --aspect_ratio "1:1" \
    --seed 999
```

### Text-to-Video Generation

Generate videos from text descriptions:

```bash
# Basic video generation (480p, ~5 seconds)
bash scripts/test_sample_video.sh "a corgi dog looks at the camera"

# With custom input image for TI2V video generation
bash scripts/test_sample_video.sh "a cat playing piano" "/path/to/input/image.jpg"

# Longer video generation (specify target length in frames)
bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 241  # ~15 seconds at 16fps
bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 481  # ~30 seconds at 16fps

# Advanced video generation
torchrun --standalone --nproc_per_node 8 sample.py \
    --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
    --checkpoint_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
    --caption "your video prompt here" \
    --sample_batch_size 1 \
    --cfg 3.5 \
    --aspect_ratio "16:9" \
    --out_fps 16 \
    --jacobi 1 --jacobi_th 0.001 \
    --target_length 161  # Customize video length
```

## 🛠️ Training

### Image Training

Train your own STARFlow model for text-to-image generation:

```bash
# Quick training test
bash scripts/test_train_image.sh 10 16

# Full training with custom parameters
torchrun --standalone --nproc_per_node 8 train.py \
    --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
    --epochs 100 \
    --batch_size 1024 \
    --wandb_name "my_starflow_training"
```

### Video Training

Train STARFlow-V for text-to-video generation:

```bash
# Quick training test
bash scripts/test_train_video.sh 10 8

# Resume training from checkpoint
torchrun --standalone --nproc_per_node 8 train.py \
    --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
    --resume_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
    --epochs 100 \
    --batch_size 192
```

## 🔧 Utilities

### Video Processing

Extract individual frames from multi-video grids:

```bash
# Extract frames from a video containing multiple video grids
python scripts/extract_image_from_video.py --input_video path/to/video.mp4 --output_dir output/

# Extract images with custom settings
python scripts/extract_images.py input_file.mp4
```

## 📁 Model Architecture

### STARFlow (3B Parameters - Text-to-Image)
- **Resolution**: 256×256
- **Architecture**: 6-block deep-shallow architecture
- **Text Encoder**: T5-XL
- **VAE**: SD-VAE
- **Features**: RoPE positional encoding, mixed precision training

### STARFlow-V (7B Parameters - Text-to-Video)
- **Resolution**: Up to 640×480 (480p)
- **Temporal**: 81 frames (16 FPS = ~5 seconds)
- **Architecture**: 6-block deep-shallow architecture (full sequence)
- **Text Encoder**: T5-XL
- **VAE**: WAN2.2-VAE
- **Features**: Causal attention, autoregressive generation, variable length support

## 🔧 Key Features

- **Autoregressive Flow Architecture**: Novel combination of autoregressive models and normalizing flows
- **High-Quality Generation**: Competetive FID scores and visual quality to State-of-the-art Diffusion Models
- **Flexible Resolution**: Support for various aspect ratios and resolutions
- **Efficient Training**: FSDP support for large-scale distributed training
- **Fast Sampling**: Block-wise Jacobi iteration for accelerated inference
- **Text Conditioning**: Advanced text-to-image/video capabilities
- **Video Generation**: Temporal consistency and smooth motion

## 📊 Configuration

### Key Parameters

#### Image Generation (`starflow_3B_t2i_256x256.yaml`)
- `img_size: 256` - Output image resolution
- `txt_size: 128` - Text sequence length
- `channels: 3072` - Model hidden dimension
- `cfg: 3.6` - Classifier-free guidance scale
- `noise_std: 0.3` - Flow noise standard deviation

#### Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`)
- `img_size: 640` - Video frame resolution
- `vid_size: '81:16'` - Temporal dimensions (frames:downsampling)
- `fps_cond: 1` - FPS conditioning enabled
- `temporal_causal: 1` - Causal temporal attention

### Sampling Options
- `--cfg` - Classifier-free guidance scale (higher = more prompt adherence)
- `--jacobi` - Enable Jacobi iteration for faster sampling
- `--jacobi_th` - Jacobi convergence threshold
- `--jacobi_block_size` - Block size for Jacobi iteration
- `--aspect_ratio` - Output aspect ratio ("1:1", "16:9", "4:3", etc.)
- `--seed` - Random seed for reproducible generation

## 📚 Project Structure

```
├── train.py               # Main training script
├── sample.py              # Sampling and inference
├── transformer_flow.py    # Core model implementation
├── dataset.py             # Dataset loading and preprocessing
├── finetune_decoder.py    # Decoder fine-tuning script
├── utils/                 # Utility modules
│   ├── common.py         # Core utility functions
│   ├── model_setup.py    # Model configuration and setup
│   ├── training.py       # Training utilities and metrics
│   └── inference.py      # Evaluation and metrics
├── configs/              # Model configuration files
│   ├── starflow_3B_t2i_256x256.yaml
│   └── starflow-v_7B_t2v_caus_480p.yaml
├── scripts/                 # Example training and sampling scripts
│   ├── test_sample_image.sh
│   ├── test_sample_video.sh
│   ├── test_train_image.sh
│   ├── test_train_video.sh
│   ├── setup_conda.sh
│   ├── extract_images.py
│   └── extract_image_from_video.py
└── misc/                  # Additional utilities
    ├── pe.py             # Positional encodings
    ├── lpips.py          # LPIPS loss
    └── wan_vae2.py       # Video VAE implementation
```

## 💡 Tips

### Image Generation
1. Use guidance scales between 2.0-5.0 for balanced quality and diversity
2. Experiment with different aspect ratios for your use case
3. Enable Jacobi iteration (`--jacobi 1`) for faster sampling
4. Use higher resolution models for detailed outputs
5. The default script uses optimized settings: `--jacobi_th 0.001` and `--jacobi_block_size 16` 

### Video Generation
1. Start with shorter sequences (81 frames) and gradually increase length (161, 241, 481+ frames)
2. Use input images (`--input_image`) for more controlled generation
3. Adjust FPS settings based on content type (8-24 FPS)
4. Consider temporal consistency when crafting prompts
5. The default script uses `--jacobi_block_size 64`.
6. **Longer videos**: Use `--target_length` to generate videos beyond the training length (requires `--jacobi 1`)
7. **Frame reference**: 81 frames ≈ 5s, 161 frames ≈ 10s, 241 frames ≈ 15s, 481 frames ≈ 30s (at 16fps)

### Training
1. Use FSDP for efficient large model training
2. Start with smaller batch sizes and scale up
3. Monitor loss curves and adjust learning rates accordingly
4. Use gradient checkpointing to reduce memory usage
5. The test scripts include `--dry_run 1` for validation

## 🔗 Citation

If you use STARFlow in your research, please cite:

```bibtex
@article{gu2025starflow,
  title={STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis},
  author={Gu, Jiatao and Chen, Tianrong and Berthelot, David and Zheng, Huangjie and Wang, Yuyang and Zhang, Ruixiang and Dinh, Laurent and Bautista, Miguel Angel and Susskind, Josh and Zhai, Shuangfei},
  journal={NeurIPS},
  year={2025}
}
```

## 📄 License

LICENSE: Please check out the repository [LICENSE](LICENSE) before using the provided code and [LICENSE_MODEL](LICENSE_MODEL) for the released models.

## 🤝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.