Generative ability grounded on creative prompts

Prompt Waveform Model
A funky hip hop song played with Scottish bagpipe.
Acoustic guitars are playing heavy metal riffs.
A flute plays a salsa song.
Standard jazz song sung in acapella.
Cartoon theme song sung by cats.

More Generation Examples

More generation examples from sampled prompts of the evaluation set MusicCaps.
Prompt Waveform Model Spectrogram Model
A female vocalist sings this upbeat Latin pop. The song has an upbeat rhythm with a dance groove. The drumming is lively, the percussion instruments add layers and density to the music, the bass line is simple and steady, the keyboard accompaniment adds a nice melody.
The song fits the hip-hop/pop genre from the early 2000's.
This is a psychedelic rock music piece. It could also be playing in the background at a hippie coffee shop.
This audio contains acoustic drums playing a groove with a lot of cymbal hits.
The sampled drums go for a usual hip-hop beat, nothing standing out and together with the sub-woofer bass drive the pulse of the music.

Diversity of generation examples

Prompt Example 1 Example 2 Example 3 Example 4
Moody jazz music with a basic formation (drums, contra bass, piano and trumpet) with a melancholy trumpet solo.
carousel horse
happy funky pop music
A song about the beauty in Hawaii

Broader Impact

We believe our work has the potential to grow into a useful tool for artists and content creators that can further enrich their creative pursuits. To live up to this promise, more work is needed with musicians and other stakeholders to develop models into a meaningful co-creation tool.

We acknowledge the limitations of the proposed model. In particular, large generative models learn to imitate patterns and biases inherent in the training sets, and in our case, the model can propagate the potential biases built in the text and music corpora used to train our models. Such biases can be hard to detect as they manifest in often subtle, unpredictable ways, which are not fully captured by our current evaluation benchmarks. Demeaning or other harmful language may be generated in model outputs, due to learned associations or by chance.

Beyond this, we recognize that musical genres are complex and key musical attributes are contextual and change over time. Training data reflect a limited corpus of musical samples and genres, given uneven recording and digitization of samples from global musical cultures. How music is categorized and labeled can essentialize genres; and these labels may be constructed and applied without the participation of communities. We caution readers not to presume each sample can generalize to an entire musical genre or one label can capture the diversity of musical genres produced within a region (i.e. “Latin music” contains a broad range of cultures and styles). Moreover, musical samples may sound "authentic" to those outside these communities, as nuances in musical traditions need trained ears/cultural knowledge to recognize. In generating vocals, there may be possible caricatures, “mock accents,” parodies, or other demeaning linguistic harms (e.g., "mock Black singing" in a request for "soulful vocals'' or "mock Spanish" in a Latin music request) that arise in text prompts requesting cultural or religious musical genres, or genres that emerged as part of the political struggles of certain communities (e.g., Black American music, Nueva canción, Chicano folk, Brazilian Tropicalismo, Sufi Qaw).

As is with any other technology, the result of our research can be misused or abused. We acknowledge the risk of potential misappropriation when the created content exactly matches examples in training data. Duplication checks are a built-in part of our current pipeline of producing and releasing examples, and will continue to be for any future work.

Efforts for identifying potential safety issues and addressing them are important components for improving these generative models. Until there is a more clear understanding of the limitations and risks, we do not intend to release the model.

Authors

Qingqing Huang*,   Daniel S. Park*,   Tao Wang,   Timo I. Denk,   Andy Ly,   Nanxin Chen,   Zhengdong Zhang,   Zhishuai Zhang,   Jiahui Yu,   Christian Frank,   Jesse Engel,   Quoc V. Le,   William Chan,   Wei Han.

Core contributor, *Equal contribution.

Acknowledgements

We are grateful to Aren Jansen for building MuLan, which is an indispensable component of this project. We give thanks to Austin Tarango, Fernando Diaz, Kathy Meier-Hellstern, Molly FitzMorris, and Renee Shelby for helping us incorporate important responsible AI practices around this project. We acknowledge support from Blake Cunningham, Cara Adams, for giving us advice along the project and assisting us with the publication process. We appreciate valuable feedback and support from Alex Ku, Andrea Agostinelli, Ankur Bapna, Chen Liang, Ed Chi, Ekin Dogus Cubuk, Erica Moreira, Esteban Real, Heiga Zen, Jaehoon Lee, James Qin, Nathan Park, Stephen Kelly, Thang Luoung, Weizhe Hua, Ye Jia, Yifeng Lu, Yonghui Wu, Yu Zhang, Yuma Koizumi. Special thanks to authors of MusicLM for helpful discussions and cooperation, and especially for sharing their evaluation set and manuscript before publication.