Title | : | Enhancing Visual Commonsense Reasoning with Image Captions using Deep Learning Techniques |
Speaker | : | Subham Das (IITM) |
Details | : | Tue, 11 Jun, 2024 11:00 AM @ SSB-233 |
Abstract: | : | Visual Commonsense Reasoning (VCR) is a challenging task in the domain of visual cognition. The Visual Question Answering (VQA) involves selecting only a correct answer for a given image and a question. The VCR task involves not only answer selection but also identification of an appropriate rationale supporting the chosen answer. Traditionally, the VCR models have predominantly relied on visual data for the reasoning process. However, achieving a comprehensive understanding of an image and utilizing it for proper reasoning remains a challenge, as it often requires reasoning beyond the visual cues. We address this issue by using additional information about the input image through its caption for reasoning in the VCR models. First, we explore different fusion techniques to integratecaptions into the existing VCR frameworks. Simple captions of images are typically brief and may not fully capture the contentof an image. To provide more comprehensive information about the image, we propose using a dense caption for the input image for reasoning in the VCR models. We also explore a contrastive learning method for integrating captions into the VCR models. By incorporating the contrastive loss alongside the existing cross-entropy loss, the VCR model can utilize supervision from the caption information before determining the correct answer and reason. This method avoids the need for merging the caption information with the VCR model, as required in the previously proposed approaches. We conducted experiments on the benchmark VCR dataset and observed an improvement in thereasoning performance of the proposed approaches compared to the baseline methods. |