Title | : | Unsupervised acoustic unit discovery and keyword spotting using syllables |
Speaker | : | Karthik Pandia D S (IITM) |
Details | : | Fri, 27 Apr, 2018 4:00 PM @ A M Turing Hall |
Abstract: | : | Speech signal encodes a wide range of information ranging from speaker identity, the emotion of the speaker, language, content and many more. The two major applications for content recognition are automatic speech recognition (ASR) and keyword spotting. An ASR recognizes the entire content of the audio, whereas a keyword recognizer recognizes only the words which are of interest. The work investigates the use of syllables, instead of phonemes, as core units for keyword spotting. A supervised keyword recognizer uses transcribed text to train models to do keyword spotting. Unsupervised keyword spotting can be based on either template matching or acoustic unit discovery (AUD). Template matching techniques involve feature extraction and segmental search. AUD based techniques focus on unsupervised sub-unit discovery and modelling. Using phoneme posterior for unsupervised keyword spotting has become a norm. To address the issue of mismatch between the query and search audio, adaptation of the phoneme posterior features to target language/speaker/data is proposed. Two approaches are proposed to segment the speech into syllable-like units. One uses the short-time energy of the signal, and the other uses vowel posterior. The syllable boundaries in the search file are used to hypothesize the keyword. Further, the obtained syllable-like segments from untranscribed audio is used for AUD. To discover the acoustic units, a top-down and a graphical clustering approaches, and Hidden Markov models (HMM) are used in tandem. The proposed approaches for keyword spotting are evaluated using the standard TIMIT and NPTEL datasets. Results show that posterior adaptation and segmentation improve the keyword detection and keyword search time. The modelling approach is indeed capable of discovering syllable-like units from the training audio, which can be used for unsupervised speech recognition. |