Title | : | Two-pass Information Bottleneck based Framework for Speaker Diarization |
Speaker | : | Nauman Abdul Razzak Dawalatabad (IITM) |
Details | : | Wed, 10 Jan, 2018 11:15 AM @ A M Turing Hall |
Abstract: | : | The task of speaker diarization involves segmentation of an audio recording of conversation among multiple speakers into speaker-homogeneous segments and association of speaker identities with the segments. Speaker diarization is an important pre-processing stage in the conversational speech recognition systems and the keyword spotting systems for conversational speech. The task of speaker diarization is challenging because the durations of speaker turns in the conversation vary significantly, and also because the information about the number of speakers participating in the conversation is not known. Unsupervised learning based approaches to speaker diarization have been extensively studied. These approaches mainly involve initial segmentation of the audio recording into fixed duration segments and then clustering the segments using a method of bottom-up clustering. The hidden Markov model/Gaussian mixture model (HMM/GMM) based approach is a parametric approach to clustering. This approach is computationally intensive because it involves estimation of parameters of the model after every iteration in clustering. The information bottleneck (IB) based approach to clustering the segments is a non-parametric approach and therefore is faster than the HMM/GMM approach. We propose a two-pass information bottleneck based framework for speaker diarization. In this framework, the first pass involves deriving the cluster labels for segments using the IB based approach. After the first pass, a multi-layer feedforward neural network (MLFFNN) is trained with the spectral feature vector of a frame of speech as the input. The cluster label obtained in the first pass is used to specify the desired output for the MLFFNN. After the MLFFNN is trained, the outputs of the nodes in the final hidden layer in the MLFFNN are used to obtain a new feature vector representation of each frame. Then the principal component analysis is carried out on the new feature vector representation to obtain a discriminative and uncorrelated feature vector representation of each frame. The original spectral feature vectors and the discriminative feature vectors of frames are combined in the space of posteriors for GMM components. The combined posterior representation is used in the second pass to perform IB based clustering again to obtain a better segmentation. The effectiveness of the proposed framework to improve the speaker diarization performance is demonstrated on standard NIST rich transcription meeting datasets. |