Gen-Verse
/

MMaDA-8B-Base

@@ -1,27 +1,160 @@
 ---
-license: mit
 library_name: transformers
 pipeline_tag: any-to-any
 ---
 # MMaDA-8B-Base
 We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
-1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
-2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
-3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
-[Paper](https://arxiv.org/abs/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)
-# Citation
 ```
 @article{yang2025mmada,
   title={MMaDA: Multimodal Large Diffusion Language Models},
   author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
   journal={arXiv preprint arXiv:2505.15809},
   year={2025}
 }
-```

 ---
 library_name: transformers
+license: mit
 pipeline_tag: any-to-any
+tags:
+- diffusion-model
+- multimodal
+- text-to-image
+- text-generation
+- image-captioning
+- generalist-llm
+language: en
 ---
 # MMaDA-8B-Base
+<div align="center">
+<br>
+<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/title.png" width="166">
+<h3>Multimodal Large Diffusion Language Models (NeurIPS 2025)</h3></div>
 We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
+1.  MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
+2.  MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
+3.  MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
+[Paper](https://huggingface.co/papers/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Project Page / Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)
+<div align="center" style="width: 600px; margin: auto;">
+  <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/showcase0.8.gif" alt="MMaDA decoding demo" width="550" />
+  <p style="font-style: italic; font-size: 14px; color: #555; margin-top: 6px;">
+    MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.<br>
+    The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
+  </p>
+</div>
+## 📰 Latest Updates
+*   **[2025-09-09]** We open source a comprehensive RL framework for diffusion language model, [dLLM-RL](https://github.com/Gen-Verse/dLLM-RL), which also supports post-training our MMaDA model.
+*   **[2025-06-02]** We open source our **MMaDA-8B-MixCoT** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT).
+*   **[2025-05-24]** We add support for MPS inference, tested on M4.
+*   **[2025-05-22]** We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
+*   **[2025-05-22]** We open source our **MMaDA-8B-Base** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-Base). **MMaDA-8B-MixCoT** and **MMaDA-8B-Max** will be released in the near future.
+*   **[2025-05-22]** We release our [research paper](https://huggingface.co/papers/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA.
+## 🧬 MMaDA Series Overview
+MMaDA includes a series of checkpoints reflecting different training stages:
+1.  **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and **thinking abilities**.
+2.  **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**: After mixed long chain-of-thought (CoT) fine-tuning. Capable of **complex** textual, multimodal and image generation reasoning.
+3.  **MMaDA-8B-Max (coming soon)**: After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.
+<div align="center">
+<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/example_compare.png" width="800">
+<p><i>Overview of MMaDA's capabilities.</i></p>
+</div>
+## ⚙️ Quick Start
+First, set up the environment by installing the required packages from the official GitHub repository:
+```bash
+pip install -r requirements.txt
+```
+Then, you can launch a local Gradio demo:
+```bash
+python app.py
+```
+Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA).
+## 🚀 Inference
+For batch-level inference, we provide inference scripts on the [official GitHub repository](https://github.com/Gen-Verse/MMaDA).
+Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:
+```bash
+wandb login
 ```
+### 1. Text Generation
+For text generation, we follow LLaDA's configuration and generation script. Simple run:
+```bash
+python generate.py
+```
+### 2. MultiModal Generation
+Inference demo for MultiModal Generation, and you can view the results on wandb:
+```python
+from inference_solver import FlexARInferenceSolver
+from PIL import Image
+inference_solver = FlexARInferenceSolver(
+    model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
+    precision="bf16",
+    target_size=512,
+)
+# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
+q1 = "Describe the image in detail. <|image|>"
+images = [Image.open("path/to/your/image.png")] # Replace with your image path
+qas = [[q1, None]]
+# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
+generated = inference_solver.generate(
+    images=images,
+    qas=qas,
+    max_gen_len=8192,
+    temperature=1.0,
+    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
+)
+a1 = generated[0]
+print(f"Generated text response: {a1}")
+# generated[1], namely the list of newly generated images, should typically be empty in this case.
+```
+### 3. Text-to-Image Generation
+Inference demo for Text-to-Image Generation, and you can view the results on wandb:
+```python
+from inference_solver import FlexARInferenceSolver
+from PIL import Image
+inference_solver = FlexARInferenceSolver(
+    model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
+    precision="bf16",
+    target_size=768,
+)
+q1 = f"Generate an image of 768x768 according to the following prompt:\
+" \
+     f"Image of a dog playing water, and a waterfall is in the background."
+# generated: tuple of (generated response, list of generated images)
+generated = inference_solver.generate(
+    images=[],
+    qas=[[q1, None]],
+    max_gen_len=8192,
+    temperature=1.0,
+    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
+)
+a1, new_image = generated[0], generated[1][0]
+new_image.show() # Display the generated image
+# new_image is a PIL Image object representing the generated image
+# print(f"Generated text response: {a1}")
+```
+## Citation
+```bibtex
 @article{yang2025mmada,
   title={MMaDA: Multimodal Large Diffusion Language Models},
   author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
   journal={arXiv preprint arXiv:2505.15809},
   year={2025}
 }
+```