Authors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz.
Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.
These audio samples were randomly sampled from the evaluation set in Table 2 and Table 3 in the paper. The S2ST models were trained on the Conversational Spanish-to-English dataset. Each S2ST model has one variant for outputting in a canonical female speaker's voice, and another variant for retaining the source speaker's voice in the translated speech.
Reference audios were synthesized with a TTS model with crosslingual voice transfer capacity (see Section 4.1 in the paper). Transcripts for the sources and references are ground truth from the corpus; transcripts for the model predictions were transcribed by an ASR model used for evaluation (see Section 5.1 in the paper).
In some cases (e.g. group 2), the TTS-synthesized references (and training targets) fail to retain the source speakers' voices. As a result, the trained S2ST models also make similar mistakes. See also the samples in the next section.
|Ground truth||Translatotron 2||Translatotron|
|Source (Spanish)||Reference (English)||Canonical voice||Voice retaining||Canonical voice||Voice retaining|
These audio samples were randomly sampled from the evaluation sets in Table 4, corresponding to Section 4.2 and 5.4.1 in the paper.
The source audios are the connetation of randomly sampled pairs of human recordings; the reference audios are the concatenation of the corresponding TTS synthesized reference audios. The model predictions are the direct outputs from the models on the concatenated source input, without further post-processing. The transcripts for the source and reference are concatenation of the ground truth from the corpus (each segment in a pair of quotation marks); the transcripts on the model predictions were transcribed by an ASR model used for evaluation.
These samples show that when the concatenation augmentation (concat aug) is used during training, both Translatotron 2 and Translatotron are able to retain each speaker's voice on inputs with speaker turns; in contrast, when the concatenation augmentation is not used, the predicted audio is typically in one input speaker's voice, and some times have trouble on handling the entire input for translation (e.g. group 3 and 4). In either case, the prediction from Translatotron 2 is sigicantly more natural, more fluent, and more complete than the same from Translatotron.
It is interesting that in group 5, despite that the TTS synthesized reference makes mistake on the first speaker's voice (incorrect gender), Translatotron 2 (w/ concat aug.) is able to predict in voices more similar to the source (correct gender).
|Source (Spanish)||Reference (English)||Translatotron 2 (w/ concat aug)||Translatotron 2||Translatotron (w/ concat aug)||Translatotron|
These audio samples were randomly sampled from the evaluation set in Table 5, corresponding to Section 5.5 in the paper. The S2ST models were trained on CoVoST 2 dataset, and are able to translate French, German, Spanish and Catalan speech into English speech in a canonical voice.
Reference audios were synthesized with a TTS model. Transcripts for the sources and references are ground truth from the corpus; transcripts for the model predictions were transcribed by an ASR model used for evaluation.
|Source||Reference (English)||Translatotron 2||Translatotron 1|