Title | : | Self-Supervised and Supervised Pre-training to Improve Automatic Speech Recognition Performance |
Speaker | : | Arun Kumar A (IITM) |
Details | : | Mon, 22 May, 2023 4:00 PM @ MR - I (SSB 233) |
Abstract: | : | The need for large amounts of labelled speech data and substantial computational resources hamper building effective Automatic Speech Recognition systems, especially in low-resource scenarios. This work investigates the use of various pre-training strategies to improve ASR performance with a specific focus on low-resource scenarios. The primary objective of this study is to improve the performance of end-to-end ASR systems in low-resource scenarios by using pre-training. Firstly, we have proposed a novel self-supervised learning (SSL) method that jointly pre-trains an encoder and a decoder using only speech data. Most of the conventional SSL models for speech use only an encoder. In the supervised learning setup, encoder-decoder networks have shown superior performance. The researchers wanted to utilize the potential of these networks during pre-training. This study analyzes the proposed method and compares it with the baseline model in various scenarios. Secondly, this work builds a large multilingual ASR model with nine Indian languages with varying amounts of labelled speech data. This model serves as a supervised pre-training model. We compare our model with Whisper, monolingual and publicly available SSL models. We also analyze the effectiveness of our model in both zero-shot and finetuning scenarios. Our findings suggest that the proposed SSL method consistently outperforms the baseline model by a significant margin. We also observed that supervised pre-training using the same family of languages works better than large models built with diverse languages. |