Vivago 2.0, developed by HiDream.ai, is an integrated AI creative suite built around a high-performance image generation foundation model. It is designed to handle complex creative workflows, ranging from high-fidelity text-to-image synthesis to advanced video and audio-visual tasks. The model is released under the MIT license, facilitating its use in both personal and commercial projects.
The underlying architecture of Vivago 2.0 is a 17 billion parameter sparse Diffusion Transformer (DiT). It utilizes a dual-stream structure that processes image and text tokens independently before transitioning to a single-stream DiT architecture for multimodal interaction. A key technical feature is its dynamic Mixture-of-Experts (MoE) design within the feed-forward networks, which routes data through specialized modules to optimize the balance between computational cost and generation quality. The system uses a LLaMA 3.1 8B backbone as its primary text encoder to ensure precise semantic understanding.
Capabilities and Variants
The model is available in three distinct variants tailored to different hardware and quality requirements:
- Full: The highest-fidelity version, optimized for 50 diffusion steps.
- Dev: A guidance-distilled version that balances speed and quality, optimized for 28 steps.
- Fast: A rapid inference version that delivers results in approximately 14 to 16 steps.
Vivago 2.0 demonstrates high prompt adherence and compositional accuracy, particularly in complex scenes involving multiple interacting objects. Beyond static imagery, the platform supports image-to-video transformation, adding motion and audio to visuals, and an AI podcast generator that produces lip-synced videos from a portrait and a voice recording. For optimal results, users can utilize the "Prompt Bot," an internal optimizer that provides technical modifiers to refine natural language descriptions into highly detailed outputs.