Audio samples from "SEANet: A Multi-modal Speech Enhancement Network"

Paper: arXiv

Authors: Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek

Abstract: We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech. We trained our model with data collected by sensors mounted on an earbud and synthetically corrupted by adding different kinds of noise sources to the audio signal. Our experimental results demonstrate that it is possible to achieve very high quality results, even in the case of interfering speech at the same level of loudness. A sample of the output produced by our model is available at

About: The audio examples in this page have been randomly selected from the evaluation datasets. Each example contains three rows: (1) clean sample; (2) mixed sample, where the audio channel has been mixed with another speech sample or a noise sample; and (3) our result for denoising the audio channel. The five columns contain: (a) key information; (b) embedded audio sample; (c) audio waveform, where blue is the clean audio and orange is mixed or denoised sample; (d) mel log spectrogram for audio; (e) mel log spectrogram for accelerometer signal. Note that accelerometer signal is the same for clean and mixed samples.

Mixed speech examples

(a) (b) Embedded audio sample (c) Audio waveform (d) Mel log spectrogram for audio (e) Mel log spectrogram for accelerometer
Clean sample
Noisy sample

Denoised sample