Title | : | End-to-End Speech Synthesis Systems for Indian Languages |
Speaker | : | Ms. Anusha Prakash (IITM) |
Details | : | Fri, 5 Jul, 2024 11:00 AM @ By Google Meet |
Abstract: | : | A text-to-speech (TTS) synthesiser is an important speech technology which generates speech corresponding to a given text. With the emergence of neural network based end-to-end (E2E) approaches, training TTS systems has become easier when a large amount of data is available for a language. However, developing E2E speech synthesisers for Indian languages presents unique challenges due to the scarcity of high-quality training data and multiple grapheme representations across different languages.
The primary objective of this thesis is to train high-quality E2E TTS systems for Indian languages. We first develop a unified representation to handle the issue of different scripts across languages. Next, multilingual TTS systems are trained based on language families, which are then extended to accommodate new languages having limited or no data, but belonging to the same language family. The aim is to exploit the phonotactic similarities across languages belonging to the same family. The analyses show that language family-wise training of Indic systems is the way forward for the Indian subcontinent, where a large number of languages are spoken.
Sentences in Indian languages are generally longer than those in English. Indian languages are also considered to be phrase-based, wherein semantically complete phrases are concatenated to make up sentences. We explore an inter-pausal unit (IPU) based approach in the E2E framework, focusing on synthesising conversational-style text. The IPU-based approach requires less computational resources and produces prosodically richer synthesis compared to conventional sentence-based systems. In the last part of the thesis, we highlight the importance of accurate phone alignments for TTS system building using a signal processing directed alignment approach.
Our research underscores the significance of seamlessly integrating linguistic knowledge and signal-processing techniques within the E2E framework. Our findings show that the proposed models outperform state-of-the-art TTS systems available for Indian languages. Most importantly, the methodologies proposed in this thesis are highly adaptable, being largely independent of specific architectural constraints. As such, they can be readily applied to other emerging E2E architectures. Meeting Link : https://meet.google.com/cgu-hiuj-njm |