Title | : | Enhancing the Quality of TTS Systems for Conversational Speech Synthesis |
Speaker | : | Ishika Gupta (IITM) |
Details | : | Mon, 24 Jul, 2023 4:00 PM @ MR - I (SSB 233) |
Abstract: | : | Current state-of-the-art text-to-speech (TTS) systems trained on read-speech have reduced issues with repetition or skipping of words and can produce natural-sounding speech. However, E2E systems still have difficulty producing conversational speech, especially in generating out-of-domain(OOD) terms, and lack appropriate prosody. This paper proposes a novel data augmentation approach to build intelligible and expressive speech using FastSpeech2 (FS2) architecture. Two different studies are performed. Conversational-style phrases/short interrogative sentences are synthesized using a baseline FS2 system and a hidden Markov model-based speech synthesis (HTS) system. Both systems are trained on 8.5 hours of read speech in Hindi. This results in DS1 (FS2) and DH1 (HTS) synthetic datasets, respectively. Using DS1 and DH1, we train FS2 models, namely S1 and H1. While S 1 sounds natural, H1 is more intelligible on OOD words. An attempt is made to further adapt these systems with as little as 11 minutes of original prosodically-rich story data from the same speaker to produce systems S2 and H2, respectively. We evaluate three FS2-based models: Baseline FS2, the proposed models, S2, and H2. The subjective evaluation shows that systems S2 and H2 significantly outperform the baseline FS2 system with an average MOS of 4.16 and 4.42, respectively. Further, we observe that H2 is better than S2 in terms of both MOS and intelligibility. We also do the objective evaluation tests and analyze the synthesized speech based on prosodic attributes to support our claim. |