Title | : | Image Captioning using Deep Neural Networks which Dispense Recurrence |
Speaker | : | Sandeep Narayanan (IITM) |
Details | : | Wed, 20 Nov, 2019 3:00 PM @ AM Turing Hall |
Abstract: | : | Describing the contents of an image has been a fundamental problem in the field of artificial intelligence. The problem has found many applications in aiding visually impaired users, improving human-machine communication, scene ontology, and organizing visual data. An image captioning model learns to describe the image content using a dataset of image-sentence pairs as samples. Recent approaches follow an encoder-decoder paradigm, wherein a convolutional neural network (CNN) is used as the image encoder. A recurrent neural network (RNN) enhanced by long-short term memory (LSTM) networks have become a dominant component of several such frameworks, utilized as a decoder. Despite their ability to reduce the vanishing gradient problem and capture dependencies, they are inherently sequential across time. Attention mechanisms alongside encoder-decoder architectures have also become integral components for solving the image captioning problem. In general, current attention mechanisms are a weighted sum of the encoded image representations and have limited modelling capabilities. Addition of explicit high-level semantic concepts of the input has been shown to enhance the performance of image captioning models to a great extent. First part of the work presents an approach which entirely dispenses the recurrence by incorporating a Transformer, a network architecture used for generating sequences relying solely on attention mechanism. The positional encodings are adapted to exploit the two-dimensional nature of the image, and a regularization component is added for training the model. Next part of the work presents a novel network possessing attention-like properties that are pervasive through its layers to caption images, by utilizing a CNN to refine and combine representations at multiple levels of the architecture. The model also exploits an explicit higher-level semantic information obtained by performing panoptic segmentation on the image. The attention capability of the model is visually demonstrated, and an experimental evaluation (quantitative) is performed on the MS-COCO dataset. The approaches are efficient and yield better performance in comparison to the state-of-the-art architectures for image captioning. |