Brain2Music: Reconstructing Music from Human Brain Activity

Timo I. Denk^∗1 Yu Takagi^∗2,3
Takuya Matsuyama² Andrea Agostinelli¹ Tomoya Nakai⁴
Christian Frank¹ Shinji Nishimoto^2,3

¹Google ²Osaka University, Japan ³NICT, Japan ⁴Araya Inc., Japan

Abstract The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli.

Please find the full paper on arXiv

^*Equal contribution; correspondence to timodenk@google.com and takagi.yuu.fbs@osaka-u.ac.jp

Music Reconstruction with MusicLM (Highlights)

This section contains three manually selected highlights (the best out of 10). The left-most column contains the simulus, i.e., the music that our human test subjects were exposed to while their brain activity was recorded. The following three columns contain three samples from MusicLM which aim to reconstruct the original music.

Stimulus (GTZAN music)	Reconstructions (generated by MusicLM)

An overview of our Brain2Music pipeline: High-dimensional fMRI responses are condensed into the semantic, 128-dimensional music embedding space of MuLan (Huang et al., 2022). Subsequently, the music generation model, MusicLM (Agostinelli et al., 2023), is conditioned to generate the music reconstruction, resembling the original stimulus. As an alternative we consider retrieving music from a large database, instead of generating it.

Comparison of Retrieval and Generation

Below we compare retrieval from FMA with generation using MusicLM (three samples). The results are random samples from test subject 1. There is one row for each of the 10 GTZAN genres.

Stimulus	Retrieval	Generation with MusicLM
GTZAN music	From FMA	Gen #1	Gen #2	Gen #3

Comparison Across Subjects

Below we compare retrieval from FMA with generation using MusicLM across all five subjects for which fMRI data has been collected. The results are random samples. There is one row for each of the 10 GTZAN genres.

Stimulus	Retrieval (FMA)					Generation (MusicLM)
Stimulus	Subject 1	Subject 2	Subject 3	Subject 4	Subject 5	Subject 1	Subject 2	Subject 3	Subject 4	Subject 5

Encoding: Whole-brain Voxel-wise Modeling

By constructing a brain encoding model, we find that two components of MusicLM (MuLan and w2v-BERT) have some degree of correspondence with human brain activity in the auditory cortex.

We also find that the brain regions representing information derived from text and music overlap.

GTZAN Music Captions

We release a music caption dataset for the subset of GTZAN clips for which there are fMRI scans. Below are ten examples from the dataset.

Clip name	GTZAN music (15s slice)	Human-written text caption
blues.00017		It is lazy blues with a laid-back tempo and relaxed atmosphere. The band structure is simple, with the background rhythm punctuated by bass and guitar cutting. The impressive phrasing of the lead guitar gives the piece a nostalgic impression.
classical.00008		Several violins play the melody. The melody is simple and almost unison, but it moves between minor and major keys and changes expression from one to the other.
country.00012		This is a classic country song. You can hear clear singing and crisp acoustic guitar cutting. The wood bass provides a solid groove with a two-beat rhythm. This is country music at its best. Ideal for nature scenes and homely atmospheres.
disco.00004		This music piece has a disco sound. Vocals and chorus create extended harmonies. The synthesiser creates catchy melodies, while the drumming beats rhythmically. Effective tambourine sounds accentuate the rhythms and add further dynamism. This music is perfect for dance parties, club floors and other scenes of dancing and fun.
hiphop.00014		This is a rap-rock piece with a lot of energy. The distorted guitars are impressive and provide an energetic sound. The bass is an eight beat, creating a dynamic groove. The drums provide the backbone of the rhythm section with their powerful hi-hats. The vocal and chorus interaction conveys tension and passion and draws the audience in.
jazz.00040		This is medium-tempo old jazz with female vocals. The band is a small band similar to a Dixie Jazz formation, including clarinet, trumpet and trombone. The vocal harmonies are supported by a piano and brass ensemble on a four beat with drums and bass.
metal.00026		This is a metal instrumental piece with technical guitar solos and distortion effects. The heavy, powerful bass creates a sense of speed, and the snare, bass and guitar create a sense of unity in unison at the end. It is full of over-the-top playing techniques and intense energy.
pop.00032		Passionate pops piece with clear sound and female vocals. The synth accompaniment spreads out pleasantly and the tight bass grooves along. The beat-oriented drums drive the rhythm, creating a strong and lively feeling. Can be used as background music in cafés and lounges to create a relaxed atmosphere.
reggae.00013		This reggae piece combines smooth, melodic vocals with a clear, high-pitched chorus. The bass is swingy and supports the rhythm, while whistles and samplers of life sounds can be heard. It is perfect for relaxing situations, such as reading in a laid-back café or strolling around town.
rock.00032		This rock piece is characterised by its extended vocals. The guitar plays scenically, while the bass enhances the melody with rhythmic fills. The drums add dynamic rhythms to the whole piece. This music is ideal for scenes with a sense of expansiveness and freedom, such as mountainous terrain with spectacular natural scenery or driving scenes on the open road.

FAQ

General

Could you describe your paper briefly?

In this paper, we explore the relationship between the observed human brain activity when human subjects are listening to music and the Google MusicLM music generation model that can create music from a text description. As part of this study we observe that

When a human and MusicLM are exposed to the same music, the internal representations of MusicLM are correlated with brain activity in certain regions.
When we use data from such regions as an input for MusicLM, we can predict and reconstruct the kinds of music which the human subject was exposed to.

Are there studies of brain decoding and encoding for music, or more generally for sound processing? What is new about this study?

Previous studies have also examined the human brain activity while participants listened to music and discovered musical feature representations in the brain. Given the recently published music generation models such as MusicLM, it is now possible to study whether human brain activity can be converted into music with such models.

Methods

What is MusicLM?

MusicLM is a type of language model that was trained on music and can be used to create music given a text describing the desired music or other inputs such as a hummed melody together with some text describing the style of the desired music. It is based on a generic framework called AudioLM, which uses language models to generate audio at high fidelity. You can try out MusicLM in Google Labs.

What brain data did you use?

We used functional magnetic resonance imaging (fMRI). This technique is detecting changes associated with blood flow and relies on the fact that cerebral blood flow and neuronal activation are connected. That is, it looks for indicators of changed blood flow and uses those to measure brain activity in the regions of interest.

Limitations

There are three main factors that are currently limiting the quality of the reconstructed music when observing human brain signals:

the information contained in the fMRI data is very temporally and spatially sparse (the observed regions are 2×2×2mm³ in size, many orders of magnitude larger than human neurons).
the information contained in the music embeddings from which we reconstruct the music (we used MuLan, in which ten seconds of music are represented by just 128 numbers).
the limitations of our music generation system. When we studied MusicLM, we saw that it has room to improve both in the way it adheres to the text prompt and in terms of the fidelity of the produced audio.

Taken together, the reconstructed music is not always close to what the subjects actually observed when the brain scan was done.

Could the model be transferred to a novel subject?

Because the brain’s anatomy differs from one person to the next, it is not possible to directly apply a model created for one individual to another. Several methods have been proposed to compensate for these differences, and it would be possible to use such methods to transfer models across subjects with a certain degree of accuracy.

What are the ethical/privacy issues that stem from this study?

Note that the decoding technology described in this paper is unlikely to become practical in the near future. In particular, reading fMRI signals from the brain requires a volunteer to spend many hours in a large fMRI scanner. While there are no immediate privacy implications from the technology as described here, as with this study, any such analysis must only be performed with the informed consent of the studied individuals.

Outlook

What are the future directions?

Note that in this work, we perform music reconstruction from fMRI signals that were recorded while human subjects were listening to music stimuli through headphones. More sophisticated technology to examine the human brain will further advance the research in this area and the model’s capability to generate music matching the heard stimulus. An exciting next step is to attempt the reconstruction of music or musical attributes from a person’s imagination.

What are the potential applications?

With the amount of work required to obtain fMRI signals from the brain, we do not have direct applications in mind. This work has been motivated by a fundamental research question: Does the MusicLM music generation model contain components that are mirrored in the human brain, and if so, which are those? And if such connections exist, how would music sound like that is inspired by signals from the human brain?