Improve model card: update paper link, add usage, project overview and tags
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,27 +1,160 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
library_name: transformers
|
|
|
|
| 4 |
pipeline_tag: any-to-any
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
# MMaDA-8B-Base
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
|
| 10 |
|
| 11 |
-
1.
|
| 12 |
-
2.
|
| 13 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
@article{yang2025mmada,
|
| 21 |
title={MMaDA: Multimodal Large Diffusion Language Models},
|
| 22 |
author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
|
| 23 |
journal={arXiv preprint arXiv:2505.15809},
|
| 24 |
year={2025}
|
| 25 |
}
|
| 26 |
-
```
|
| 27 |
-
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
+
license: mit
|
| 4 |
pipeline_tag: any-to-any
|
| 5 |
+
tags:
|
| 6 |
+
- diffusion-model
|
| 7 |
+
- multimodal
|
| 8 |
+
- text-to-image
|
| 9 |
+
- text-generation
|
| 10 |
+
- image-captioning
|
| 11 |
+
- generalist-llm
|
| 12 |
+
language: en
|
| 13 |
---
|
| 14 |
|
| 15 |
# MMaDA-8B-Base
|
| 16 |
|
| 17 |
+
<div align="center">
|
| 18 |
+
<br>
|
| 19 |
+
<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/title.png" width="166">
|
| 20 |
+
<h3>Multimodal Large Diffusion Language Models (NeurIPS 2025)</h3></div>
|
| 21 |
+
|
| 22 |
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
|
| 23 |
|
| 24 |
+
1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
|
| 25 |
+
2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
|
| 26 |
+
3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
|
| 27 |
+
|
| 28 |
+
[Paper](https://huggingface.co/papers/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Project Page / Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)
|
| 29 |
+
|
| 30 |
+
<div align="center" style="width: 600px; margin: auto;">
|
| 31 |
+
<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/showcase0.8.gif" alt="MMaDA decoding demo" width="550" />
|
| 32 |
+
<p style="font-style: italic; font-size: 14px; color: #555; margin-top: 6px;">
|
| 33 |
+
MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.<br>
|
| 34 |
+
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
|
| 35 |
+
</p>
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
## 📰 Latest Updates
|
| 39 |
+
* **[2025-09-09]** We open source a comprehensive RL framework for diffusion language model, [dLLM-RL](https://github.com/Gen-Verse/dLLM-RL), which also supports post-training our MMaDA model.
|
| 40 |
+
* **[2025-06-02]** We open source our **MMaDA-8B-MixCoT** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT).
|
| 41 |
+
* **[2025-05-24]** We add support for MPS inference, tested on M4.
|
| 42 |
+
* **[2025-05-22]** We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
|
| 43 |
+
* **[2025-05-22]** We open source our **MMaDA-8B-Base** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-Base). **MMaDA-8B-MixCoT** and **MMaDA-8B-Max** will be released in the near future.
|
| 44 |
+
* **[2025-05-22]** We release our [research paper](https://huggingface.co/papers/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA.
|
| 45 |
+
|
| 46 |
+
## 🧬 MMaDA Series Overview
|
| 47 |
+
|
| 48 |
+
MMaDA includes a series of checkpoints reflecting different training stages:
|
| 49 |
+
1. **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and **thinking abilities**.
|
| 50 |
+
2. **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**: After mixed long chain-of-thought (CoT) fine-tuning. Capable of **complex** textual, multimodal and image generation reasoning.
|
| 51 |
+
3. **MMaDA-8B-Max (coming soon)**: After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.
|
| 52 |
+
|
| 53 |
+
<div align="center">
|
| 54 |
+
<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/example_compare.png" width="800">
|
| 55 |
+
<p><i>Overview of MMaDA's capabilities.</i></p>
|
| 56 |
+
</div>
|
| 57 |
|
| 58 |
+
## ⚙️ Quick Start
|
| 59 |
|
| 60 |
+
First, set up the environment by installing the required packages from the official GitHub repository:
|
| 61 |
+
```bash
|
| 62 |
+
pip install -r requirements.txt
|
| 63 |
+
```
|
| 64 |
+
Then, you can launch a local Gradio demo:
|
| 65 |
+
```bash
|
| 66 |
+
python app.py
|
| 67 |
+
```
|
| 68 |
+
Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA).
|
| 69 |
+
|
| 70 |
+
## 🚀 Inference
|
| 71 |
|
| 72 |
+
For batch-level inference, we provide inference scripts on the [official GitHub repository](https://github.com/Gen-Verse/MMaDA).
|
| 73 |
+
|
| 74 |
+
Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:
|
| 75 |
+
```bash
|
| 76 |
+
wandb login
|
| 77 |
```
|
| 78 |
+
|
| 79 |
+
### 1. Text Generation
|
| 80 |
+
|
| 81 |
+
For text generation, we follow LLaDA's configuration and generation script. Simple run:
|
| 82 |
+
```bash
|
| 83 |
+
python generate.py
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### 2. MultiModal Generation
|
| 87 |
+
|
| 88 |
+
Inference demo for MultiModal Generation, and you can view the results on wandb:
|
| 89 |
+
```python
|
| 90 |
+
from inference_solver import FlexARInferenceSolver
|
| 91 |
+
from PIL import Image
|
| 92 |
+
|
| 93 |
+
inference_solver = FlexARInferenceSolver(
|
| 94 |
+
model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
|
| 95 |
+
precision="bf16",
|
| 96 |
+
target_size=512,
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
|
| 100 |
+
q1 = "Describe the image in detail. <|image|>"
|
| 101 |
+
|
| 102 |
+
images = [Image.open("path/to/your/image.png")] # Replace with your image path
|
| 103 |
+
qas = [[q1, None]]
|
| 104 |
+
|
| 105 |
+
# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
|
| 106 |
+
generated = inference_solver.generate(
|
| 107 |
+
images=images,
|
| 108 |
+
qas=qas,
|
| 109 |
+
max_gen_len=8192,
|
| 110 |
+
temperature=1.0,
|
| 111 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
a1 = generated[0]
|
| 115 |
+
print(f"Generated text response: {a1}")
|
| 116 |
+
# generated[1], namely the list of newly generated images, should typically be empty in this case.
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### 3. Text-to-Image Generation
|
| 120 |
+
|
| 121 |
+
Inference demo for Text-to-Image Generation, and you can view the results on wandb:
|
| 122 |
+
```python
|
| 123 |
+
from inference_solver import FlexARInferenceSolver
|
| 124 |
+
from PIL import Image
|
| 125 |
+
|
| 126 |
+
inference_solver = FlexARInferenceSolver(
|
| 127 |
+
model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
|
| 128 |
+
precision="bf16",
|
| 129 |
+
target_size=768,
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
q1 = f"Generate an image of 768x768 according to the following prompt:\
|
| 133 |
+
" \
|
| 134 |
+
f"Image of a dog playing water, and a waterfall is in the background."
|
| 135 |
+
|
| 136 |
+
# generated: tuple of (generated response, list of generated images)
|
| 137 |
+
generated = inference_solver.generate(
|
| 138 |
+
images=[],
|
| 139 |
+
qas=[[q1, None]],
|
| 140 |
+
max_gen_len=8192,
|
| 141 |
+
temperature=1.0,
|
| 142 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
a1, new_image = generated[0], generated[1][0]
|
| 146 |
+
new_image.show() # Display the generated image
|
| 147 |
+
# new_image is a PIL Image object representing the generated image
|
| 148 |
+
# print(f"Generated text response: {a1}")
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
## Citation
|
| 152 |
+
|
| 153 |
+
```bibtex
|
| 154 |
@article{yang2025mmada,
|
| 155 |
title={MMaDA: Multimodal Large Diffusion Language Models},
|
| 156 |
author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
|
| 157 |
journal={arXiv preprint arXiv:2505.15809},
|
| 158 |
year={2025}
|
| 159 |
}
|
| 160 |
+
```
|
|
|