More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Michael Hassid
Michelle Tadmor Ramanovich
Brendan Shillingford
Miaosen Wang
Ye Jia
Tal Remez
1Google Research
2DeepMind
VDTTS
VDTTS
"it's not an entitlement
it's a demand"
"it's not an entitlement...
Speech
Speech
Viewer does not support full SVG 1.1

In this paper we present VDTTS, a Visually-Driven Text-To-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.

Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2.

Paper

first page of paper
[arXiv]

Model

Text Encoder
Text Encoder
Video Encoder
Video Encoder
Multi Source Attention
Multi Source Attention
Input Frames
Input Frames
Input Text
Input Text
Spectrogram Decoder
Spectrogram Decoder
Vocoder
Vocoder
Speaker ID
Speaker ID
Speaker Encoder
Speaker Encoder
Generated Speech
Generated Speech
VDTTS
VDTTS
Viewer does not support full SVG 1.1

Model comparison samples

This section and the following section show samples from the VDTTS model and several baselines.

The first video column labeled Ground-truth video displays the original video clip. The following column, VDTTS, displays the audio predicted using both the video frames and the text as input. The following two columns display the lack of sync exhibited by models that use only video frames (VDTTS video-only), or only text (TTS text-only).

Click on a video to play. On small screens, scroll sideways to see the whole table.

Transcript Speaker ID Ground-truth video VDTTS VDTTS video-only
(no text input)
TTS text-only
(no video input)
of space for people to make their own judgments and to come to their own 0
absolutely love dancing i have no dance experience whatsoever but as that 1
making money and so some guy went and bought that girl 2
on a over christmas and we had a guide because its really treasure 3

Speaker swap samples

These samples show what happens when a different speaker embedding from the original speaker is provided to the model.

The highlighted diagonal represents the videos for which the identity represented by the speaker embedding matches the speaker in the video frames.

Presented are random examples of timing and prosody recovery which are disentangled from the speaker's voice. In each row, audio is synthesized given video and text from one input example and a speaker voice from another, and is presented here only to demonstrate that the prosody recovered is the product of the video signal rather than the voice input. The model is not intended to be used to switch speaker voices/gender etc.

On small screens, scroll sideways to see the whole table.

VDTTS using:
Transcript Speaker ID Ground-truth video speaker 0's embedding speaker 1's embedding speaker 2's embedding speaker 3's embedding
which is the mc donalds could really become a giant company if they were a real estate company 0
history together an emotional history together and i know the facts 1
there i did what was i going to do move to spain move my whole life over 2
trade support systems on the other hand its not an entitlement its a demand to end slavery 3