Before each conformer block, the d-vector is combined with the block’s inputs using feature-wise linear modulation (FiLM). The encoder also recieves the d-vector of the target speaker as additional input. The features are stacked together before being passed to the encoder. The inputs to the primary encoder are LFBE features from the noisy signal and the AEC reference signal. The primary encoder consists of N modified conformer blocks.
Reference and noise context, when not available, is assumed to be an uninformative silence signal. A single model then processes these contextual signals to produce enhanced features that are passed to the ASR system. Noise context carries useful information about the acoustic context and has been shown to be useful in prior work. a few seconds of audio before the target utterance to be recognized, and the target speaker embedding are assumed to be available. In the case of speech enhancement and separation, a noise context, i.e. In the case of AEC, a reference signal and a speaker embedding of the target speaker is assumed to be available. The joint model is constructed as a contextual frontend processing model, wherein the contextual information is assumed to be optional. A joint model is interesting for practical reasons since it is hard to know the interference type to address ahead of time especially in a streaming recognition setting. In the presented work, we address all three interference types jointly. One way to mitigate this is to jointly train the enhancement frontend together with the backend ASR model. It is well known that improving speech quality does not always improve ASR performance since the distortions introduced by non-linear processing can adversely affect ASR. While most approaches focus on improving quality of enhanced speech, some have also focused on improving ASR, as we do in this work. AEC has also been studied in isolation, or together with background noise. Techniques developed for speaker separation have also been applied to remove non-speech noise, with modifications to the training data. When using speaker embedding, the target speaker of interest is assumed to be known a priori. Speech separation has received a lot of attention in the recent literature using techniques like deep clustering, permutation invariant training, and using speaker embedding. The three classes of interference mentioned above have been addressed in the literature, typically in isolation, using separate modeling strategies. As a result, it is often convenient to train and maintain separate frontend feature processing models that handle adverse conditions, without combining it with the backend ASR model. Furthermore, with large scale multi-domain and multi-lingual modeling gaining more research interest, the training data for ASR models often covers varied use cases (acoustic and linguistic), like voice search and video captioning, making it more challenging to simultaneously address harsher noise conditions. While it is possible to train separate ASR models that specifically address these conditions, for practical reasons, it is harder to maintain multiple task-specific ASR models and switch on-the-fly based on the use case. Nevertheless, various factors like device echo (in the case of smart speakers), harsher background noise and competing speech, still significantly deteriorate performance. Robustness of automatic speech recognition (ASR) systems has significantly improved over the years with the advent of neural networks based end-to-end models, large-scale training data, and better data augmentation strategies. Joint model reduces the word error rate in low signal-to-noise ratio conditions Significantly reduces word error rate in noisy conditions even when using a That the joint model performs almost as well as the task-specific models, and Is not only critical in speech separation, but also helpful for echoĬancellation and speech enhancement. Representing the voice characteristic of the target speaker of interest, which Which is useful for speech enhancement and (3) an embedding vector Playback audio, which is necessary for echo cancellation (2) a noise context, Make use of different types of side inputs: (1) a reference signal of the This isĪchieved by using a contextual enhancement neural network that can optionally Recognition (ASR), that jointly implements three modules within a single model:Īcoustic echo cancellation, speech enhancement, and speech separation. We present a frontend for improving robustness of automatic speech