MusicLM generates high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM can take as input a text prompt, a melody, or an existing track which it can alter or continue. Moreover, it supports creating seamlessly loopable music derived from any set of inputs. Building on AudioLM, music generation is performed as a hierarchical sequence-to-sequence modeling task, generating music at a sampling rate of up to 48 kHz. Below we compare the latest version of MusicLM to the originally published version on the same prompts. Recent improvements include the integration of classifier-free guidance, improved acoustic tokens, and a new backbone architecture specifically designed to operate on such acoustic tokens, as well as applying SoundStorm, to achieve efficient high-fidelity audio generation.

MusicLM Research Team1

Andrea Agostinelli, Zalán Borsos, Antoine Caillon, Geoffrey Cideron, Timo Denk, Chris Donahue, Michael Dooley, Jesse Engel, Christian Frank, Sertan Girgin, Qingqing Huang, Aren Jansen, Matej Kastelic, Yunpeng Li, Brian McWilliams, Adam Roberts, Matt Sharifi, Ondrej Skopek, Marco Tagliasacchi, Alex Tudor, Mauro Verzetti, Damien Vincent, Neil Zeghidour and Mauricio Zuluaga

Many thanks to all our collaborators across Alphabet.

1. Includes current and past members in alphabetical order.

Our favorites

Caption MusicFX

Audio Generation From Rich Captions

Caption Paper2 MusicFX3
2. Original paper: MusicLM: Generating Music From Text
3. New version: MusicFX in Google Labs


Caption MusicFX4
4. Audio is repeated three times to demonstrate a loop.

Painting Caption Conditioning

Painting Paper MusicFX