Title | : | Hybrid Speech Synthesis Systems for Indian Languages |
Speaker | : | Sudhanshu Srivastava (IITM) |
Details | : | Wed, 13 Mar, 2024 12:00 PM @ Google Meet |
Abstract: | : | There are primarily two different paradigms in classical methods for text-to-speech (TTS) synthesis: Unit Selection Synthesis (USS) and Hidden-Markov-model (HMM)-based TTS (HTS). USS is concatenative speech synthesis that retains the desired timbre. HTS is a parametric approach to speech synthesis. HTS systems offer flexibility in speaking styles, a small footprint, fast training, and synthesis, all while being computationally less demanding. However, USS may have abrupt joins, and HTS speech is muffled owing to the statistical averaging of parameters. The state-of-the-art systems are neural-network (NN) based end-to-end (E2E) speech synthesis. While the quality of E2E systems is very natural, they do not scale well for low-resource scenarios and suffer from poor quality conversational speech synthesis.
The objective is to propose hybrid TTS systems combining the benefits of USS and HTS with the E2E framework and produce synthesis quality that has a small footprint and preserves naturalness. We initially present a technique for combining the classical USS with E2E systems to create a hybrid model for Indian languages. This hybrid system guides the USS system using the E2E approach.
Additionally, we propose a method that combines HTS with the NN-based Waveglow vocoder, incorporating histogram equalization (HEQ) to improve performance in low-resource settings. This hybrid approach bridges the two paradigms by applying HEQ across HTS-generated audio and the originally recorded waves from training data. Furthermore, we propose another hybrid approach that combines HMM-based feature generation with the NN-based HiFi-GAN vocoder to enhance HTS synthesis quality. In this setup, HTS is trained on mel-spectrograms instead of traditional MGCs, and the resulting mel-spectrogram corresponding to the input text is used to reconstruct speech through the HiFi-GAN vocoder. This system achieves speech quality comparable to E2E without artifacts or skips. It offers the advantages of speed and GPU-independent inferencing. These results are quite promising as they pave the way to good quality TTS systems with less data compared to E2E systems. Web Conference Link :meet.google.com/sog-mfsc-dfu |