Improve model card: update paper link, add usage, project overview and tags

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +141 -8
README.md CHANGED
@@ -1,27 +1,160 @@
1
  ---
2
- license: mit
3
  library_name: transformers
 
4
  pipeline_tag: any-to-any
 
 
 
 
 
 
 
 
5
  ---
6
 
7
  # MMaDA-8B-Base
8
 
 
 
 
 
 
9
  We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
10
 
11
- 1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
12
- 2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
13
- 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- [Paper](https://arxiv.org/abs/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)
16
 
17
- # Citation
 
 
 
 
 
 
 
 
 
 
18
 
 
 
 
 
 
19
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  @article{yang2025mmada,
21
  title={MMaDA: Multimodal Large Diffusion Language Models},
22
  author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
23
  journal={arXiv preprint arXiv:2505.15809},
24
  year={2025}
25
  }
26
- ```
27
-
 
1
  ---
 
2
  library_name: transformers
3
+ license: mit
4
  pipeline_tag: any-to-any
5
+ tags:
6
+ - diffusion-model
7
+ - multimodal
8
+ - text-to-image
9
+ - text-generation
10
+ - image-captioning
11
+ - generalist-llm
12
+ language: en
13
  ---
14
 
15
  # MMaDA-8B-Base
16
 
17
+ <div align="center">
18
+ <br>
19
+ <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/title.png" width="166">
20
+ <h3>Multimodal Large Diffusion Language Models (NeurIPS 2025)</h3></div>
21
+
22
  We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
23
 
24
+ 1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
25
+ 2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
26
+ 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
27
+
28
+ [Paper](https://huggingface.co/papers/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Project Page / Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)
29
+
30
+ <div align="center" style="width: 600px; margin: auto;">
31
+ <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/showcase0.8.gif" alt="MMaDA decoding demo" width="550" />
32
+ <p style="font-style: italic; font-size: 14px; color: #555; margin-top: 6px;">
33
+ MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.<br>
34
+ The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
35
+ </p>
36
+ </div>
37
+
38
+ ## 📰 Latest Updates
39
+ * **[2025-09-09]** We open source a comprehensive RL framework for diffusion language model, [dLLM-RL](https://github.com/Gen-Verse/dLLM-RL), which also supports post-training our MMaDA model.
40
+ * **[2025-06-02]** We open source our **MMaDA-8B-MixCoT** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT).
41
+ * **[2025-05-24]** We add support for MPS inference, tested on M4.
42
+ * **[2025-05-22]** We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
43
+ * **[2025-05-22]** We open source our **MMaDA-8B-Base** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-Base). **MMaDA-8B-MixCoT** and **MMaDA-8B-Max** will be released in the near future.
44
+ * **[2025-05-22]** We release our [research paper](https://huggingface.co/papers/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA.
45
+
46
+ ## 🧬 MMaDA Series Overview
47
+
48
+ MMaDA includes a series of checkpoints reflecting different training stages:
49
+ 1. **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and **thinking abilities**.
50
+ 2. **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**: After mixed long chain-of-thought (CoT) fine-tuning. Capable of **complex** textual, multimodal and image generation reasoning.
51
+ 3. **MMaDA-8B-Max (coming soon)**: After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.
52
+
53
+ <div align="center">
54
+ <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/example_compare.png" width="800">
55
+ <p><i>Overview of MMaDA's capabilities.</i></p>
56
+ </div>
57
 
58
+ ## ⚙️ Quick Start
59
 
60
+ First, set up the environment by installing the required packages from the official GitHub repository:
61
+ ```bash
62
+ pip install -r requirements.txt
63
+ ```
64
+ Then, you can launch a local Gradio demo:
65
+ ```bash
66
+ python app.py
67
+ ```
68
+ Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA).
69
+
70
+ ## 🚀 Inference
71
 
72
+ For batch-level inference, we provide inference scripts on the [official GitHub repository](https://github.com/Gen-Verse/MMaDA).
73
+
74
+ Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:
75
+ ```bash
76
+ wandb login
77
  ```
78
+
79
+ ### 1. Text Generation
80
+
81
+ For text generation, we follow LLaDA's configuration and generation script. Simple run:
82
+ ```bash
83
+ python generate.py
84
+ ```
85
+
86
+ ### 2. MultiModal Generation
87
+
88
+ Inference demo for MultiModal Generation, and you can view the results on wandb:
89
+ ```python
90
+ from inference_solver import FlexARInferenceSolver
91
+ from PIL import Image
92
+
93
+ inference_solver = FlexARInferenceSolver(
94
+ model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
95
+ precision="bf16",
96
+ target_size=512,
97
+ )
98
+
99
+ # "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
100
+ q1 = "Describe the image in detail. <|image|>"
101
+
102
+ images = [Image.open("path/to/your/image.png")] # Replace with your image path
103
+ qas = [[q1, None]]
104
+
105
+ # `len(images)` should be equal to the number of appearance of "<|image|>" in qas
106
+ generated = inference_solver.generate(
107
+ images=images,
108
+ qas=qas,
109
+ max_gen_len=8192,
110
+ temperature=1.0,
111
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
112
+ )
113
+
114
+ a1 = generated[0]
115
+ print(f"Generated text response: {a1}")
116
+ # generated[1], namely the list of newly generated images, should typically be empty in this case.
117
+ ```
118
+
119
+ ### 3. Text-to-Image Generation
120
+
121
+ Inference demo for Text-to-Image Generation, and you can view the results on wandb:
122
+ ```python
123
+ from inference_solver import FlexARInferenceSolver
124
+ from PIL import Image
125
+
126
+ inference_solver = FlexARInferenceSolver(
127
+ model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
128
+ precision="bf16",
129
+ target_size=768,
130
+ )
131
+
132
+ q1 = f"Generate an image of 768x768 according to the following prompt:\
133
+ " \
134
+ f"Image of a dog playing water, and a waterfall is in the background."
135
+
136
+ # generated: tuple of (generated response, list of generated images)
137
+ generated = inference_solver.generate(
138
+ images=[],
139
+ qas=[[q1, None]],
140
+ max_gen_len=8192,
141
+ temperature=1.0,
142
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
143
+ )
144
+
145
+ a1, new_image = generated[0], generated[1][0]
146
+ new_image.show() # Display the generated image
147
+ # new_image is a PIL Image object representing the generated image
148
+ # print(f"Generated text response: {a1}")
149
+ ```
150
+
151
+ ## Citation
152
+
153
+ ```bibtex
154
  @article{yang2025mmada,
155
  title={MMaDA: Multimodal Large Diffusion Language Models},
156
  author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
157
  journal={arXiv preprint arXiv:2505.15809},
158
  year={2025}
159
  }
160
+ ```