Audio samples for "SpeechPainter: Text-conditioned Speech Inpainting"

Zalán Borsos, Matt Sharifi, Marco Tagliasacchi

Capabilities of SpeechPainter

SpeechPainter receives as input an utterance containing a gap of at most one second and the transcript. It must learn to identify the part of the transcript corresponding to the utterance (dark gray) and the gap (bold), and fill in the gap with the correct content, while maintaining speaker identity, prosody and recording environment conditions. The following samples for unseen speakers are randomly selected from LibriTTS (test-clean and test-other splits) and VCTK.

For the inpainted gap, the model can maintain:
Input Sample (transcript, gap start, gap length)
speaker id and prosody
 |George Robertson desc|ribed the plan |as an
outrageous and disg|raceful decision.
0.89, 0.84
speaker id and prosody
 |I am very sorry that Alastair |Campbell has taken
|this decision.|
1.40, 0.84
background noise
 His recovery was destined to be almost as sudden
|as his disappearance, and was due| directly |to the
tra|mp Alex had brought to Sunnyside.
1.83, 0.75