SkyReels V4 is a unified multimodal foundation model for joint video and audio generation, developed by Skywork AI (a subsidiary of Kunlun Wanwei). It is designed to natively synthesize temporally synchronized audio and video within a single architectural framework, addressing the common industry challenge of fragmented visual and acoustic pipelines. The model produces cinematic-quality output at 1080p resolution and 32 FPS, supporting durations of up to 15 seconds.
The architecture is built on a dual-stream Multimodal Diffusion Transformer (MMDiT). This design employs symmetric twin backbones where one branch focuses on video synthesis and the other on audio generation, both sharing a powerful Multimodal Large Language Model (MLLM) text encoder. To maintain precise alignment, the model uses 3D Rotary Positional Embeddings (3D-RoPE) and bidirectional audio-video cross-attention layers, ensuring that sound effects like footsteps or lip movements are synchronized at a microsecond level.
Beyond basic generation, SkyReels V4 unifies video editing, inpainting, and restoration through a channel-concatenation interface. It accepts a wide range of multimodal inputs, including text prompts, reference images, video clips, masks, and audio references. Creators can utilize specialized features like Grid Reference, which maintains character and style consistency across a narrative by referencing up to nine storyboard keyframes to guide the temporal rhythm and visual continuity.
To optimize the generation of high-resolution sequences, SkyReels V4 uses an efficiency strategy that jointly generates low-resolution full sequences alongside high-resolution keyframes. These are then refined using dedicated super-resolution and frame interpolation models. This progressive approach, combined with a six-stage training curriculum, allows the model to scale from simple static objects to complex, multi-shot cinematic plots while maintaining motion stability and physical plausibility.