Mochi 1 by Genmo: Benchmarks, Rankings & Model Details

Mochi 1 is an open-weights video generation model developed by Genmo, designed to produce high-fidelity video from text prompts. It is characterized by its focus on motion realism and high prompt adherence, aiming to simulate complex physical effects such as fluid dynamics, fur, and human movement. At the time of its release, it was noted as the largest publicly available open-source video generative model.

The model utilizes a 10-billion-parameter Asymmetric Diffusion Transformer (AsymmDiT) architecture. This design processes text and visual tokens with an asymmetric capacity, featuring a visual stream that possesses significantly more parameters than the text stream to prioritize visual reasoning. It employs a single T5-XXL language model for text encoding and a custom causal VAE, known as AsymmVAE, which compresses video data into a latent space for efficient processing.

Released under the Apache 2.0 license, Mochi 1 supports the generation of 480p videos at 30 frames per second with durations up to 5.4 seconds. While optimized for photorealistic output, the model is intended to serve as a base for further community development and fine-tuning.

Mochi 1

Explore AI Studio

Rankings & Comparison

Mochi 1

Explore AI Studio

Rankings & Comparison