Title | : | Anwesha: A Prototype for Search in Bangla |
Speaker | : | Arup Das (IITM) |
Details | : | Tue, 11 Apr, 2023 3:30 PM @ MR - I (SSB 233) |
Abstract: | : | Bengali, or Bangla, is a low-resource, highly agglutinative language. Designing effective search and retrieval systems in Bangla is quite challenging. However, these systems can help more people access information without any language barrier. We present our explorations toward building "অনà§à¦¬à§‡à¦·à¦¾"/ AnbÄ“á¹£Ä (EN: Anwesha), a prototype for a search engine in Bangla. Since no gold standard dataset is available to evaluate the effectiveness of Bangla Information Retrieval Systems, we have created a gold standard dataset containing 194 queries over 2182 documents from diverse domains. We analyze the system's performance on queries of varying difficulty levels depicting various search scenarios. To the best of our knowledge, Anwesha is the first such initiative in Bangla that facilitates retrieval of semantically rela ted documents by use of diverse knowledge sources like IndoWordNet, statistical co-occurrences (by way of Latent Semantic Analysis (LSA)) and external knowledge sources like Wikipedia (by way of Explicit Semantic Analysis (ESA)). We present methods to enhance the user's understanding of the search results by highlighting keywords within the top retrieved documents that LSA or ESA consider semantically related to the query. We also overcome the limitations of existing spell-check and lemmatization approaches in Bangla and integrate them into Anwesha. A Word Sense Disambiguation approach like the Adapted Lesk Algorithm is used to identify the appropriate sense of the words in the query. The IndoWordNet is used to expand the query provided the user settings allow it. We utilize Named Entities to improve the retrieval effectiveness of keyword-based search algorithms. We suggest a reasonable starting model for leveraging implicit preference feedback based on the user search behaviour to enhance the results retrieved by the Explicit Semantic Analysis (ESA) approach. We use contextual sentence embeddings obtained via Language-agnostic BERT Sentence Embedding (LaBSE) to rerank the candidate documents retrieved by the various search algorithms based on the best sentences in the document most relevant to the query. We present our empirical findings ac across these directions, critically analyze the results and provide insights into the effectiveness of different techniques. We demonstrate promising improvements compared to the existing ones, Anwesan, Pipilika and Sandhan. We envisage our work to inspire researchers working in the information retrieval domain in other low-resource, highly-inflected, Indian regional languages similar to Bangla, such as Assamese, Maithili, Oriya, Hindi, and Manipuri, to facilitate effective semantic search in their respective languages. |