|paper|
Cong Han1*, Kevin Wilson2, Scott Wisdom2, John R. Hershey2
1Columbia University
2Google Research
*Work performed during internship at Google.
A key challenge in machine learning is to generalize from training data to a test domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlapping speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
We evaluated our models on both synthetic and real mixtures from the AMI Corpus, training single- and multi-channel models on various combinations of datasets with supervised permutation invariant training (PIT) and unsupervised MixIT.
Example 1 | Example 2 | Example 3 | ||||
Noisy Mixture | ||||||
1 Microphone | ||||||
Sup. Synth AMI | ||||||
Unsup. AMI | ||||||
YFCC100M Warm Start |
||||||
Sup. Synth AMI, Unsup. AMI |
||||||
Sup. Synth AMI, Unsup. AMI, YFCC100M Warm Start |
||||||
4 Microphones | ||||||
Sup. Synth AMI | ||||||
Unsup. AMI | ||||||
Sup. Synth AMI, Unsup. AMI |
||||||
Sup. Synth AMI, Unsup. AMI, YFCC100M Warm Start |
||||||
Reference Audio | ||||||
Headset Filtered to Distant Mic |
||||||
Headset | ||||||
Source 1 | Source 2 | Source 1 | Source 2 | Source 1 | Source 2 | |
Example 1 | Example 2 | Example 3 | ||||
Noisy Mixture | ||||||
1 Microphone | ||||||
Sup. Synth AMI | ||||||
Unsup. AMI | ||||||
YFCC100M Warm Start |
||||||
Sup. Synth AMI, Unsup. AMI |
||||||
Sup. Synth AMI, Unsup. AMI, YFCC100M Warm Start |
||||||
4 Microphones | ||||||
Sup. Synth AMI | ||||||
Unsup. AMI | ||||||
Sup. Synth AMI, Unsup. AMI |
||||||
Sup. Synth AMI, Unsup. AMI, YFCC100M Warm Start |
||||||
Reference Audio | ||||||
Headset Filtered to Distant Mic |
||||||
Headset | ||||||
Source 1 | Source 2 | Source 1 | Source 2 | Source 1 | Source 2 | |