MusicLM generates high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM can take as input a text prompt, a melody, or an existing track which it can alter or continue. Moreover, it supports creating seamlessly loopable music derived from any set of inputs. Building on AudioLM, music generation is performed as a hierarchical sequence-to-sequence modeling task, generating music at a sampling rate of up to 48 kHz. Below we compare the latest version of MusicLM to the originally published version on the same prompts. Recent improvements include the integration of classifier-free guidance, improved acoustic tokens, and a new backbone architecture specifically designed to operate on such acoustic tokens, as well as applying SoundStorm, to achieve efficient high-fidelity audio generation.

