Title | : | Studying mixtures of microbial sequences through the lens of extended topic models |
Speaker | : | Brintha V P (IITM) |
Details | : | Mon, 6 Mar, 2023 3:30 PM @ CS25 |
Abstract: | : | An abundance of microbes, including various bacterial and viral species, are all around us and inside us. Recent advances in genome sequencing and associated computational methods have provided a lens to probe the diversity and composition of the microbiota in a sample of interest. But challenges remain in tackling the noisy and high-dimensional nature of microbial sequencing data. In our work, we focus on developing bioinformatics methods that address these challenges in the context of two specific applications: (i) detecting mixed infection in Tuberculosis (TB) disease samples using extended topic models, and (ii) studying the interaction between two respiratory diseases, TB and COVID-19, in terms of their microbial compositions. This talk will elaborate on the computational approaches we've proposed towards these applications, especially the first application, and highlight the relevant results. Mixed infection hinders the successful treatment of TB, and occurs when two or more strains of the TB-causing bacteria are present at the same time in an individual. While several studies exist on multi-drug resistant TB, only a few have predicted the proportion and mutational profile of different strains present in a mixed-infection TB sample and rely on a reference database built using the known strains. A main challenge then is to identify de novo strains that are not present in the reference database. First, we present a probabilistic generative model called Demixer to determine the ratio of mixed infection strains in whole genome sequencing (WGS) data, using a hybrid approach that combines the advantages of reference-based and reference-free approaches. Our model extends the Latent Dirichlet allocation (LDA), a latent variable modeling technique widely used in text mining for determining the proportion of different topics in a document, to model mutations; and uses the mutations of known strains as seeds to help detect novel strains. Our proposed method could precisely detect the identity and the proportions of the mixed strains present in synthetic and real-world benchmark datasets. We demonstrate the generalizability of our model by applying it to TB as well as SARS-CoV-2 WGS samples. Finally, we propose a new study to quantify the microbial composition in the respiratory tract of TB patients with or without COVID-19 using metagenomic sequencing (MGS) data, and discuss research challenges and opportunities in the analysis of such disease-disease MGS data. |