Introduction
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models.
Key Features
- ๐ Exceptional Efficiency and Performance: With only 6B parameters, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design.
- ๐ Powerful Chinese Text Rendering: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary.
- ๐ Remarkable Photorealism: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images.
๐จ Showcase
Quick Start
Installation
Clone the repo:
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
Install dependencies:
# create conda environment
conda create -n longcat-image python=3.10
conda activate longcat-image
# install other requirements
pip install -r requirements.txt
python setup.py develop
Run Text-to-Image Generation
Leveraging a stronger LLM for prompt refinement can further enhance image generation quality. Please refer to inference_t2i.py for detailed usage instructions.
Special Handling for Text Rendering
For both Text-to-Image and Image Editing tasks involving text generation, you must enclose the target text within quotes (
"").Reason: The tokenizer applies character-level encoding specifically to content found inside quotes. Failure to use explicit quotation marks will result in a significant degradation of text rendering quality.
import torch
from transformers import AutoProcessor
from longcat_image.models import LongCatImageTransformer2DModel
from longcat_image.pipelines import LongCatImagePipeline
device = torch.device('cuda')
checkpoint_dir = './weights/LongCat-Image'
text_processor = AutoProcessor.from_pretrained( checkpoint_dir, subfolder = 'tokenizer' )
transformer = LongCatImageTransformer2DModel.from_pretrained( checkpoint_dir , subfolder = 'transformer',
torch_dtype=torch.bfloat16, use_safetensors=True).to(device)
pipe = LongCatImagePipeline.from_pretrained(
checkpoint_dir,
transformer=transformer,
text_processor=text_processor
)
# pipe.to(device, torch.bfloat16) # Uncomment for high VRAM devices (Faster inference)
pipe.enable_model_cpu_offload() # Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM
prompt = 'ไธไธชๅนด่ฝป็ไบ่ฃๅฅณๆง๏ผ่บซ็ฉฟ้ป่ฒ้็ป่กซ๏ผๆญ้
็ฝ่ฒ้กน้พใๅฅน็ๅๆๆพๅจ่็ไธ๏ผ่กจๆ
ๆฌ้ใ่ๆฏๆฏไธๅ ต็ฒ็ณ็็ ๅข๏ผๅๅ็้ณๅ
ๆธฉๆๅฐๆดๅจๅฅน่บซไธ๏ผ่ฅ้ ๅบไธ็งๅฎ้่ๆธฉ้ฆจ็ๆฐๅดใ้ๅคด้็จไธญ่ท็ฆป่ง่ง๏ผ็ชๅบๅฅน็็ฅๆๅๆ้ฅฐ็็ป่ใๅ
็บฟๆๅๅฐๆๅจๅฅน็่ธไธ๏ผๅผบ่ฐๅฅน็ไบๅฎๅ้ฅฐๅ็่ดจๆ๏ผๅขๅ ็ป้ข็ๅฑๆฌกๆไธไบฒๅๅใๆดไธช็ป้ขๆๅพ็ฎๆด๏ผ็ ๅข็็บน็ไธ้ณๅ
็ๅ
ๅฝฑๆๆ็ธๅพ็ๅฝฐ๏ผ็ชๆพๅบไบบ็ฉ็ไผ้
ไธไปๅฎนใ'
image = pipe(
prompt,
height=768,
width=1344,
guidance_scale=4.5,
num_inference_steps=50,
num_images_per_prompt=1,
generator=torch.Generator("cpu").manual_seed(43),
enable_cfg_renorm=True,
enable_prompt_rewrite=True # Reusing the text encoder as a built-in prompt rewriter
).images[0]
image.save('./t2i_example.png')
- Downloads last month
- 449