Title | : | Building and Evaluating large-scale, broad-coverage multilingual translation models for Indian languages |
Speaker | : | Pranjal Agadh Chitale (IITM) |
Details | : | Thu, 6 Jun, 2024 2:00 PM @ SSB-233 |
Abstract: | : | India's linguistic diversity, encompassing languages from four major families spoken by over a billion people, presents unique challenges and opportunities for Machine Translation (MT). This work addresses the gap in high-quality and accessible MT systems for India's 22 constitutionally recognized languages. Previously, there was no comprehensive parallel training data, robust demography-focused benchmarks, or translation models that supported all of these languages. The first part of this study focuses on creating and open-sourcing high-quality training datasets, diverse benchmarks, and robust multilingual translation models. The first major contribution is the BPCC (Bharat Parallel Corpora Collection), the largest publicly available parallel corpora for Indic languages, consisting of 230 million bitext pairs, including 126 million newly added pairs and 644K professionally translated sentence pairs. Further, we release the first multi-domain, India-centric, n-way parallel benchmark for all 22 languages, covering both prose as well as conversational content. The IndicTrans2 models, supporting all 22 languages, outperform all existing models, including commercial systems, across multiple benchmarks on English-centric as well as non-English-centric directions. To ensure widespread adoption, we further release lightweight distilled variants of these models, offering competitive performance to the larger models. The second part of this work explores the use of Large Language Models (LLMs) for MT via in-context learning (ICL). This study provides the first comprehensive analysis of ICL for MT, revealing that ICL is primarily driven by examples rather than instructions. It examines factors influencing performance, such as the quality and quantity of demonstrations, spatial proximity, and the originality of source versus target content. The study also explores challenging scenarios like indirectness and misalignment, finding that target distribution quality is crucial, and sometimes perturbations can also enhance performance. Notably, ICL does not require examples from the same task; related tasks with the same target distribution are sufficient. |