Vidu Q2 is a multimodal generative model developed by ShengShu Technology in collaboration with Tsinghua University. Building on the foundation of the original Vidu system, the Q2 series serves as a unified engine for both high-fidelity image and video generation. While the platform initially gained recognition for its video capabilities, the Q2 update launched on December 1, 2025, introduced a comprehensive image stack that includes text-to-image, enhanced reference-to-image, and integrated image editing functions.
The model is built upon the Universal Vision Transformer (U-ViT) architecture, a diffusion-transformer hybrid that allows for precise control over visual details and motion. Vidu Q2 is specifically engineered for professional-grade production, supporting native resolutions up to 4K for static images. It excels in "micro-acting," a feature that enables the generation of nuanced facial expressions—such as blinks and subtle lip movements—that maintain character identity and emotional realism across different outputs.
One of the primary strengths of Vidu Q2 is its focus on visual consistency. The reference-to-image capability allows creators to upload multiple images (up to seven in some workflows) to "lock" specific subjects, characters, or artistic styles. This ensures that key visual elements remains stable across different scenes or when transitioning from a static image to a video sequence. The model also features a specialized understanding of specific aesthetics, including high-performance rendering for anime, traditional Chinese ink painting, and complex cinematic lighting.
For effective generation, users can input descriptive text prompts of up to 1,500 characters, specifying details such as camera angles (dolly zooms, pans, or tracking shots), lighting conditions, and spatial relationships. The model supports standard aspect ratios including 16:9, 9:16, and 1:1. Because images and video share the same underlying visual DNA, assets created in the image generator can be used as direct references for video generation with minimal loss of fidelity or subject drift.