Paper: arXiv
Authors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz.
Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.
These audio samples were randomly sampled from the evaluation corresponding to Table 2 and Table 3 in the paper. The S2ST models were trained on the Conversational Spanish-to-English dataset. For both Translatotron 2 and Translatotron, we include one variant that outputs in a canonical female speaker's voice, and another variant that preserves the source speaker's voice to the translation speech.
Reference audios were synthesized using a TTS model with zero-shot crosslingual voice transfer capacity (see Section 4.1 in the paper). Transcripts for the source and reference are ground truth from the dataset; transcripts for the model prediction were transcribed by an ASR model used for evaluation (see Section 5.1 in the paper).
In some cases (e.g. group 2), the TTS-synthesized references (and training targets) fail to transfer the source speakers' voices. As a result, the trained S2ST models also make similar mistakes. See also the samples in the next section.
Ground truth | Translatotron 2 | Translatotron | Cascade (ST → TTS) | ||||
---|---|---|---|---|---|---|---|
Source (Spanish) | Reference (English) | Canonical voice | Voice preserved | Canonical voice | Voice preserved | Canonical voice |
These audio samples were randomly sampled from the evaluation in Table 4, corresponding to Section 4.2 and Section 5.4.1 in the paper.
The source audios are the concatenation of two randomly sampled human recordings; the reference audios are the concatenation of the corresponding TTS synthesized reference audios. The model predictions are the direct outputs from the models on the concatenated source input, without extra pre-/post-processing. The transcripts for the source and reference are concatenation of the ground truth from the dataset (each segment in a pair of quotation marks); the transcripts on the model predictions were transcribed by an ASR model used for evaluation.
These samples show that when ConcatAug is used during training, both Translatotron 2 and Translatotron are able to preserve each speaker's voice on inputs with speaker turns; in contrast, when ConcatAug is not used, the predicted audio is typically in one input speaker's voice, and some times have trouble on handling the entire input for translation (e.g. group 3 and 4). In either case, the prediction from Translatotron 2 is significantly more natural, more fluent, and more complete than the same from Translatotron.
It is interesting that in group 5, despite that the TTS synthesized reference makes mistake on the first speaker's voice (incorrect gender), Translatotron 2 (w/ ConcatAug) is able to predict in voices more similar to the source (correct gender).
Source (Spanish) | Reference (English) | Translatotron 2 (w/ ConcatAug) | Translatotron 2 | Translatotron (w/ ConcatAug) | Translatotron |
---|
These audio samples were randomly sampled from the evaluation in Table 5, corresponding to Section 5.5 in the paper. The S2ST models were trained on the dataset derived from CoVoST 2, and are able to translate French, German, Spanish and Catalan speech into English speech in a canonical voice.
Reference audios were synthesized with a TTS model. Transcripts for the sources and references are ground truth from the dataset; transcripts for the model predictions were transcribed by an ASR model used for evaluation.
Source | Reference (English) | Translatotron 2 | Translatotron 1 |
---|