SoundStorm:

Efficient Parallel Audio Generation

[paper]

Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

Google Research

Abstract. We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

Dialogue Synthesis

SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS (Kharitonov et al., 2023), can synthesize high quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations). When synthesizing dialogue segments of 30 seconds, we measured a runtime of 2 seconds on a single TPU-v4. The following text and speakers have not been seen during training.

Text Voice Prompt Synthesized Dialogue

Unprompted and Prompted Generation

We demonstrate the capabilities of SoundStorm to generate audio conditioned on the semantic tokens of AudioLM (Borsos et al., 2022) with and without 3-second voice prompts. SoundStorm samples different speakers in the unprompted case, and maintains the speaker's voice with high consistency in the prompted case, while generating audio two orders of magnitude faster than AudioLM's acoustic generator. The original samples are from LibriSpeech test-clean.

Original Unprompted Prompted

Baselines

When generating audio in the prompted case, SoundStorm generations have higher acoustic consistency and preserve the speaker's voice from the prompt better than AudioLM. Compared to RVQ level-wise greedy decoding with the same model, SoundStorm produces audio with higher quality.

Original AudioLM Greedy SoundStorm

Broader impact

SoundStorm is a model for high-quality, efficient generation of neural audio codec-derived representations of audio. In this work, we use it as a replacement for the acoustic generation pipeline of AudioLM and SPEAR-TTS. We acknowledge that the audio samples produced by the model may be influenced by the biases present in the training data, for instance in terms of represented accents and voice characteristics. In our generated samples, we demonstrate that we can reliably control speaker characteristics via prompting. However, a more thorough analysis of any training data and its limitations would be an area of future work in line with our responsible AI principles. In turn, the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation. Thus, it is crucial to put in place safeguards against the potential misuse: to this end, we have verified that, after such a replacement, the generated audio remains detectable by a dedicated classifier (98.5% using the same classifier as Borsos et al. (2022)). Hence, as a component of a larger system, we believe that SoundStorm would be unlikely to introduce additional risks to those discussed previously by Borsos et al. (2022) and Kharitonov et al. (2023). At the same time, we hope that relaxing the memory and computational requirements of AudioLM would make research in the domain of audio generation more accessible to a wider community. In the future, we plan to explore other approaches for detecting synthesized speech, e.g., audio watermarking, so that any potential product usage of this technology strictly follows our responsible AI principles.