LongCat-Image is a 6-billion parameter text-to-image foundation model developed by Meituan. It is designed as a bilingual (Chinese and English) open-source model that prioritizes inference efficiency and high-fidelity text rendering. The model aims to bridge the gap between heavy proprietary models and efficient open-weight solutions, offering photorealistic output and precise instruction following with significantly lower VRAM consumption than many larger competitors.
The model architecture is built on a hybrid Multimodal Diffusion Transformer (MM-DiT), similar to the Flux series. It processes image and text data through separate attention paths in the early layers before merging them, which allows for tighter control over the generation process based on the text prompt. This design helps the model achieve studio-grade visual quality, including complex lighting and accurate material textures, while maintaining a compact 6B parameter footprint.
A key capability of LongCat-Image is its specialized character-level encoding strategy, specifically optimized for rendering Chinese and English text. Unlike many diffusion models that struggle with spelling, LongCat-Image utilizes a hybrid approach that processes text letter-by-letter or character-by-character when specific triggers are used. It covers the full set of 8,105 standard Chinese characters, ensuring high stability and legibility for e-commerce and marketing assets.
To achieve optimal results for text generation, users must enclose the target text within single or double quotation marks (e.g., "Open"). This formatting triggers the specialized encoding mechanism; failing to use quotes can result in compromised text rendering. The model is often paired with LongCat-Image-Edit, a variant optimized for natural language image manipulation that preserves the structural integrity of original subjects.