Title | : | Towards General-Purpose Audio Representation Learning |
Speaker | : | Ashish Seth (IITM) |
Details | : | Mon, 22 May, 2023 3:00 PM @ MR - I (SSB 233) |
Abstract: | : | Recently, unsupervised representation learning has been very successful in various domains, such as text, vision, and speech, including self-supervised and semi-supervised learning. These methods have performed well in sequence-to-sequence learning paradigms, including Automatic Speech Recognition (ASR). Still, they have yet to be extensively evaluated on other varieties of speech and non-speech classification tasks, including speaker identification, musical instrument recognition and many more. While supervised learning frameworks have been extensively studied for such problems, it has major shortcomings such as subpar performance in case of less paired training data and generalizability issues across multiple downstream tasks. To address these shortcomings, we develop self-supervised representation learning frameworks which do not require paired data and generalize well across various downstream tasks. We call this a general-purpose audio representation learning framework. In this work, we introduce two new general-purpose audio representation learning frameworks, DeLoRes (Decorrelating latent spaces for Low Resource audio representation learning) and SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations). For DeLoRes, we learn representations invariant to distortions in the input space. Two variants of DeLoRes framework are proposed in this work, DeLoRes-S and DeLoRes-M. For SLICER, we combine the best from clustering and contrasting learning paradigms by developing cluster and instance-level contrastive learning objectives. Finally, to evaluate such frameworks, we introduce a standard evaluation benchmark called LAPE (Low resource Audio Processing and Evaluation) which spans 11 diverse downstream tasks containing a balanced set of speech and non-speech tasks. Performance of the proposed metho ds is compared with the baselines and absolute improvement across major downstream tasks is observed. |