Title | : | Beyond associations: Causal Relation Extraction of various biomedical entities |
Speaker | : | Nency Bansal (IITM) |
Details | : | Wed, 18 Sep, 2024 11:00 AM @ MR-1 (SSB 233, First |
Abstract: | : | Information on causal relationships is essential to many sciences (including biomedical science, where knowing if a gene-disease relation is causal vs. merely associative can lead to better treatments); and can foster research on causal side-information-based machine learning as well. Automatically extracting causal relations from large text corpora remains less explored though, despite much work on Relation Extraction (RE). The few existing CRE (Causal RE) studies are limited to extracting causality within a sentence or for a particular disease, mainly due to the lack of a diverse benchmark dataset. Here, we carefully curate a new CRE Dataset (CRED) of 3553 (causal and non-causal) gene-disease pairs, spanning 284 diseases and 500 genes, within or across sentences of 267 published abstracts. CRED is assembled in two phases to reduce class imbalance, and its inter-annotator agreement is 89%. To assess CRED’s utility in classifying causal vs. non-causal pairs, we compared multiple classifiers and found SVM to perform the best (F1 score 0.70). Both in terms of classifier performance and model interpretability (i.e., whether the model focuses importance/attention on words with causal connotations in abstracts), CRED outperformed a state-of-the-art RE dataset. We also apply our model to real-world data. We applied it to all the abstracts on Parkinson's disease and observed that genes predicted to be causal in at least 50 papers were found to be already linked to PD in books. Next, to bridge the gap between experimental findings and published literature regarding gene-gene causal relations, we extract literature data from the EVEX database and experimental data from CRISPR data and merge them to improve the reliability or validate the extracted causal genes from either source. As the next steps, we plan to get the corpus-wide score of the causal relation using truth discovery and other approaches. |