Instrumental
#14
Parameters300M - 3.3B

MusicGen is a single-stage auto-regressive Transformer model developed by Meta's Fundamental AI Research (FAIR) team for high-quality music generation. It is designed to generate consistent music samples from text descriptions or melodic prompts, simplifying previous multi-stage approaches by predicting audio tokens in parallel.

The architecture utilizes the EnCodec neural audio compression model for tokenization, allowing the model to process discrete audio representations efficiently. It was trained on approximately 20,000 hours of licensed music, including 10,000 high-quality tracks and additional instrumental data sourced from ShutterStock and Pond5.

MusicGen is available in multiple sizes, including 300M, 1.5B, and 3.3B parameter versions. Beyond basic text-to-music capabilities, specific variants support melody conditioning, where the model can be guided by a reference audio file to influence the melodic structure of the generated output.

Rankings & Comparison