Google's Veo 3 Preview is a high-fidelity video generation model developed by Google DeepMind. It represents the third generation of the Veo series, introducing the capability to generate native synchronized audio—including ambient noise, sound effects, and character dialogue—directly alongside the visual content. The model is designed to simulate real-world physics, lighting, and human motion with high accuracy, producing cinematic clips up to eight seconds in length at 24 frames per second.
Architecture and Technical Details
The model is built on a Latent Diffusion Transformer (DiT) architecture. Unlike traditional models that treat video and audio as separate streams, Veo 3 operates in a unified 3D latent space (spatiotemporal), allowing it to process visual spacetime patches and temporal audio information simultaneously. This joint denoising process ensures precise alignment between visual events and their corresponding sounds, such as lip-syncing or the impact of physical objects.
Key Capabilities
Veo 3 supports both text-to-video and image-to-video workflows, allowing users to guide generation with detailed prompts or reference images. The model offers significant improvements in prompt adherence compared to its predecessors, capable of interpreting complex narrative instructions and cinematic styles. At launch, it supported 720p and 1080p resolutions in both landscape and portrait aspect ratios, with later iterations expanding support to 4K output.