BASNet

Binaural Angular Separation Network

[paper] [poster]

Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann

Google

Abstract. We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.

Data Collection

To showcase the effectiveness of BASNet in spatial separation, we use an off-the-shelf Pixel tablet to record ESTI speech played back from three directions: 0 degree in front to the device as well as 90/270 degree which are to the left and right of the device. The recordings are the capture of the two built-in microphones around the front-view camera with 7cm spacing. The 0 degree recording is then mixed with 90 or 270 degree speech at various offsets to create overlapped or non-overlapped speech at 0-dB SNR (third column in the table below). The mixed two channel audios are then passed through a 48kHz-BASNet model to extract only the 0-degree audio from the mixture.

Samples

Note that we used a version of BASNet trained with 48kHz audio end-to-end, whereas the paper describes the 16kHz version of BASNet.

Angle of interference speech Mixing of target and interference speech Mixed stereo audio 48kHz-BASNet output audio
N/A no interference speech
90 degree overlapped target and interference of opposite gender
90 degree overlapped target and interference of the same gender
90 degree non-overlapped
270 degree overlapped target and interference of opposite gender
270 degree overlapped target and interference of the same gender
270 degree non-overlapped