Brain2Music: Reconstructing Music from Human Brain Activity

Timo I. Denk∗1 Yu Takagi∗2,3
Takuya Matsuyama2 Andrea Agostinelli1 Tomoya Nakai4
Christian Frank1 Shinji Nishimoto2,3

1Google 2Osaka University, Japan 3NICT, Japan 4Araya Inc., Japan

Abstract  The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music generation model conditioned on embeddings derived from fMRI data. The generated music resembles the musical stimuli that human subjects experienced, with respect to semantic properties like genre, instrumentation, and mood. We investigate the relationship between different components of MusicLM and brain activity through a voxel-wise encoding modeling analysis. Furthermore, we discuss which brain regions represent information derived from purely textual descriptions of music stimuli.

Please find the full paper on arXiv

*Equal contribution; correspondence to timodenk@google.com and takagi.yuu.fbs@osaka-u.ac.jp

Music Reconstruction with MusicLM (Highlights)

This section contains three manually selected highlights (the best out of 10). The left-most column contains the simulus, i.e., the music that our human test subjects were exposed to while their brain activity was recorded. The following three columns contain three samples from MusicLM which aim to reconstruct the original music.
An overview of our Brain2Music pipeline: High-dimensional fMRI responses are condensed into the semantic, 128-dimensional music embedding space of MuLan (Huang et al., 2022). Subsequently, the music generation model, MusicLM (Agostinelli et al., 2023), is conditioned to generate the music reconstruction, resembling the original stimulus. As an alternative we consider retrieving music from a large database, instead of generating it.

Comparison of Retrieval and Generation

Below we compare retrieval from FMA with generation using MusicLM (three samples). The results are random samples from test subject 1. There is one row for each of the 10 GTZAN genres.

Comparison Across Subjects

Below we compare retrieval from FMA with generation using MusicLM across all five subjects for which fMRI data has been collected. The results are random samples. There is one row for each of the 10 GTZAN genres.

Encoding: Whole-brain Voxel-wise Modeling


By constructing a brain encoding model, we find that two components of MusicLM (MuLan and w2v-BERT) have some degree of correspondence with human brain activity in the auditory cortex.

We also find that the brain regions representing information derived from text and music overlap.

GTZAN Music Captions

We release a music caption dataset for the subset of GTZAN clips for which there are fMRI scans. Below are ten examples from the dataset.

FAQ

General

Could you describe your paper briefly?

In this paper, we explore the relationship between the observed human brain activity when human subjects are listening to music and the Google MusicLM music generation model that can create music from a text description. As part of this study we observe that

  • When a human and MusicLM are exposed to the same music, the internal representations of MusicLM are correlated with brain activity in certain regions.
  • When we use data from such regions as an input for MusicLM, we can predict and reconstruct the kinds of music which the human subject was exposed to.

Are there studies of brain decoding and encoding for music, or more generally for sound processing? What is new about this study?

Previous studies have also examined the human brain activity while participants listened to music and discovered musical feature representations in the brain. Given the recently published music generation models such as MusicLM, it is now possible to study whether human brain activity can be converted into music with such models.

Methods

What is MusicLM?

MusicLM is a type of language model that was trained on music and can be used to create music given a text describing the desired music or other inputs such as a hummed melody together with some text describing the style of the desired music. It is based on a generic framework called AudioLM, which uses language models to generate audio at high fidelity. You can try out MusicLM in Google Labs.

What brain data did you use?

We used functional magnetic resonance imaging (fMRI). This technique is detecting changes associated with blood flow and relies on the fact that cerebral blood flow and neuronal activation are connected. That is, it looks for indicators of changed blood flow and uses those to measure brain activity in the regions of interest.

Limitations

There are three main factors that are currently limiting the quality of the reconstructed music when observing human brain signals:

  • the information contained in the fMRI data is very temporally and spatially sparse (the observed regions are 2×2×2mm3 in size, many orders of magnitude larger than human neurons).
  • the information contained in the music embeddings from which we reconstruct the music (we used MuLan, in which ten seconds of music are represented by just 128 numbers).
  • the limitations of our music generation system. When we studied MusicLM, we saw that it has room to improve both in the way it adheres to the text prompt and in terms of the fidelity of the produced audio.
Taken together, the reconstructed music is not always close to what the subjects actually observed when the brain scan was done.

Could the model be transferred to a novel subject?

Because the brain’s anatomy differs from one person to the next, it is not possible to directly apply a model created for one individual to another. Several methods have been proposed to compensate for these differences, and it would be possible to use such methods to transfer models across subjects with a certain degree of accuracy.

What are the ethical/privacy issues that stem from this study?

Note that the decoding technology described in this paper is unlikely to become practical in the near future. In particular, reading fMRI signals from the brain requires a volunteer to spend many hours in a large fMRI scanner. While there are no immediate privacy implications from the technology as described here, as with this study, any such analysis must only be performed with the informed consent of the studied individuals.

Outlook

What are the future directions?

Note that in this work, we perform music reconstruction from fMRI signals that were recorded while human subjects were listening to music stimuli through headphones. More sophisticated technology to examine the human brain will further advance the research in this area and the model’s capability to generate music matching the heard stimulus. An exciting next step is to attempt the reconstruction of music or musical attributes from a person’s imagination.

What are the potential applications?

With the amount of work required to obtain fMRI signals from the brain, we do not have direct applications in mind. This work has been motivated by a fundamental research question: Does the MusicLM music generation model contain components that are mirrored in the human brain, and if so, which are those? And if such connections exist, how would music sound like that is inspired by signals from the human brain?