Title | : | Developing Input Tools and Romanized Tools for Indian Languages |
Speaker | : | Yash Hareshkumar Madhani (IITM) |
Details | : | Tue, 27 Jun, 2023 9:00 AM @ Google Meet |
Abstract: | : | We introduce resources for the transliteration task and language identification (LID) task for advancing research in the Indian languages. These resources are very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs. However, few training and evaluation sets are publicly available for these tasks, which makes it difficult to develop and evaluate robust models. To address this gap, we present Aksharantar (which means transliteration in Sanskrit), the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora, as well as collecting data from human annotators. The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts. Aksharantar is 21 times larger than existing datasets. It is the first publicly available dataset for 7 languages and 1 language family. Using the training set, we trained IndicXlit, a multilingual transliteration model that improves accuracy by 15% on the Dakshina test set over the best-reported results, and establishes strong baselines on the Aksharantar testset introduced along with this work. Our analysis revea ls directions to focus on for further improvement of Indic transliteration models. As a part of the second work, we present Bhasha-Abhijnaanam, a language identification test set that spans all 22 Indic languages listed in the Indian constitution, in both native-script and romanized text. We designed this testset to address gaps in previous benchmarks that did not cover all these languages, and the romanized testset increases the coverage from 11 languages to 20 languages. We also trained IndicLID, a language identifier for all the above-mentioned languages in both native and romanized scripts. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. Additionally, IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar, but we provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language and our findings are relevant to other languages that need romanized language identification. All the datasets, models, mining scripts, and guidelines are publicly available under open-source licenses. We hope that the availability of these large-scale, open resources will spur innovation for Indic language transliteration and language identification tasks and their downstream applications. Google Meet Link: https://meet.google.com/yei-pegr-hyb |