Translatotron 3: Speech to Speech Translation with Monolingual Data

Eliya Nachmani
Alon Levkovitch
Chulayuth Asawaroengchai
Yifan Ding
Heiga Zen
Michelle Tadmor Ramanovich
1Google Research
2Google DeepMind

This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding mapping, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting 18.14 BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, or specialized modeling to replicate para-/non-linguistic information such as such as pauses, speaking rates, and speaker identity, Translatotron 3 showcases its capability to retain it.

Model

Phase 1 uses the reconstruction loss via the auto-encoding path. Phase 2 employs the reconstruction loss via back-translation.

The two training phases in the proposed approach. (1) Phase 1 uses the reconstruction loss via the auto-encoding path. (2) Phase 2 employs the reconstruction loss via back-translation.

Model Comparison Samples

  • Spanish-to-English (on Conversational dataset)
  • English-to-Spanish (on Conversational dataset)
  • Spanish-to-English (on CommonVoice11 Synthesized dataset)
  • English-to-Spanish (on CommonVoice11 Synthesized dataset)
  • Spanish-to-English (on CommonVoice11 dataset)
  • This section and the following sections show samples from the Translatotron 3 model trained unsupervised, without any parallel data.

    The first audio column labeled Source plays the original audio clip with the original speaker. The following column, Reference, plays the synthesized audio translation of source with a random speaker. The following column play Predicted plays the Translatotron 3 model prediction: translated content by speaker preserved from source. The speaker is preserved without speaker modeling or supervision.

    Spanish-to-English (on Conversational dataset)

    Source (Spanish)Reference (English)Predicted (English), Source speaker
    Creación de nuevos escenarios legales. Creation of new legal scenarios. Creation of new legal scenario.
    Sí, creo que puedo hacer eso. Yeah! Yeah, I think I can do that. Yeah i think i can do this.
    Ver toda la discografía de Eliseo Parra. See the whole discography of Eliseo Parra. See the whole discography of Arteropoly.
    Sellos relacionados con Richard Sjöberg. Labels related to Richard Sjöberg. Labels related to Richard Carlyin.
    UTR con un par de excepciones. UTR with a couple of exceptions. MS with a couple of exceptions.

    English-to-Spanish (on Conversational dataset)

    Source (English)Reference (Spanish)Predicted (Spanish), Source speaker
    So, why are you doing this? Entonces, ¿por qué estás haciendo esto? Bien por qué estás haciendo esto?
    Yeah, I know, but I need to learn. Sí, lo sé, pero necesito aprender. Sí lo sé pero necesito aprendeo.
    It is a great weight, but also it is a necessity. Es un gran peso, pero también es una necesidad. Es un gran peso pero también es una gran consecuencia.
    Check Availability at Residence Casamalfi. compruebe la disponibilidad de Residence Casamalfi. Compruebe la disponibilidad de residence stat.
    I do not care what he says. No me importa lo que diga. No me importa lo que le diga.

    Spanish-to-English (on CommonVoice11 Synthesized dataset)

    Source (Spanish)Reference (English)Predicted (English), Source speaker
    Esto es una familia. y en una familia. This is a family and in a family. This is a family, and, in a family.
    Participó en la Royal Rumble, pero fue eliminado por R-Truth. He participated in the Royal Rumble, but was eliminated by R-Truth. He participated in the royal rumble but was eliminated by airtrue.
    En Diciembre, el grupo está evidentemente presente en el tradicional evento Dis Inferno. In December the group is evidently present in the traditional event Des Infn. In December, the group is evidently present in the traditional event Dis Inferno.
    Los dos ganadores disputaron la final. The two winners disputed the final. The two winners disputed the final.
    Líricamente, el álbum es muy político. Lyrically the album is very political. Lyrically, the album is very political.

    English-to-Spanish (on CommonVoice11 Synthesized dataset)

    Source (English)Reference (Spanish)Predicted (Spanish), Source speaker
    Her mother, Angela, is a public servant, and her father, Tony, is a psychologist. Su madre, Angela, es una servidora pública y su padre, Tony, es un psicólogo. Su madre Angela es una poca sirviente y su padre Tony es un psicólogo
    The organization has worked in Honduras, Colombia, Venezuela, Uganda and the United States. La organización ha trabajado en Honduras, Colombia, Venezuela, Uganda y Estados Unidos. La organización ha trabajado en Honduras Colombia Venezuela Uganda y Estados Unidos.
    Practically all songs have been written by Michelle. Prácticamente todas las canciones han sido escritas por Michelle. Prácticamente todas las canciones han sido escritas por Miche.
    He attended Wellesley College, where he studied physics and astronomy. Asistió al Wellesley College, donde estudió física y astronomía. Asistió al Wesley college donde estudió física y astronomía.
    Three days in four chapters, in four stories of four friends. Tres días en cuatro capítulos en cuatro historias de cuatro amigos. Tres días en cuatro capítulos en cuatro historias de cuatro amigos.

    Spanish-to-English (on CommonVoice11 dataset)

    Source (Spanish)Reference (English)Predicted (English), Source speaker
    Trabajó con orquestas en Rusia, y con músicos en Europa y los Estados Unidos. He worked with orchestras in Russia, and with musicians in Europe and the United States. Traveled with orchestrs in Russia and with musicians in Europe and the United States.
    Con ella muchas ciudades y colonias asumieron el rango de "municipium". With it, many cities and colonies assumed the rank of “municipium”. With him many cities and colonies assumed the rank of minicippia.
    Hay tres generaciones por año en el sur de Texas. There are three generations per year in the south of Texas. I three generations per year in the south of Texas.
    Fue enterrado en Madison, Wisconsin. He was buried in Madison, Wisconsin. He was buried in Madison Wisconsin.
    Algunos autores clasifican esa información como falsa. Some authors classify that information as false. All authors classify that information as fall.

    Translatotron 3 Hyperparameter table: