Title | : | Deep Learning Models based Approaches to Video Captioning using Multimodal Features |
Speaker | : | Hemalatha M (IITM) |
Details | : | Mon, 19 Jun, 2023 9:30 AM @ MR - I (SSB 233) |
Abstract: | : | Video captioning is an important application of computer vision that generates a natural language description for a video. In this thesis, we address a few issues in video captioning and p ropose deep learning model-based approaches to improve the performance of video captioning. A video has both spatial and temporal information. This information is used in training a video captioning model. But the description for a video typically relies on the domain to which the video belongs, like sports, music, news and cooking. We first propose a domain-specific semantics-guided video captioning model using domain-based information.We classify the videos into various domains using a multi-layer feed-forward neural network model. We use domain-based information like domainspecific semantics, domain-specific vocabulary, and static and temporal features to train an Long Short Term Memory (LSTM)-based domain-specific decoder for each domain. During testing, each domain-specific decoder generates a description for a video. The description with the maximum score is chosen as the caption of the video. Many existing models for video captioning that are trained using the Maximum Likelihood (ML) method generate captions that include the most often repeated words present in the captions of the training dataset, leading to the generation of highly similar descriptions for a particular type of video. The domain-specific semantics-guided video captioning model was also trained using Maximum Likelihood (ML) method. These models include the most often repeated words present in the captions of the training dataset, leading to the generation of highly similar descriptions for a particular type of video. We propose a Semantically Contextual Generative Adversarial Network (SC-GAN) that uses Reinforcement Learning (RL)-based training. The LSTM-based generator in the SC-GAN model is trained to generate descriptions similar to the ground truth descriptions. The discriminator in the SC-GAN model is trained to discriminate the ground truth descriptions from the descriptions generated by th e generator. The generator is trained using two types of rewards: goal-based reward and semantics-based reward. The goal-based reward ensures that the generated descriptions are similar to the ground truth descriptions. The semantics-based reward ensures the semantic keywords are included in the generated description. The domain-specific semantics-guided and SC-GAN models process every video frame to extract features. But a video can be effectively summarized using only the keyframes. Hence we propose a multimodal attention-based transformer that uses the keyframe features. A bimodal attention block is used in the encoder to capture the attention between the keyframe features and the features of the objects detected in the keyframes. A trimodal attention block is used in the decoder to capture the attention between the semantic keyword embedding features, the word embedding features of the words generated up to the previous time step, and the features from the encoder. The previous models proposed in this thesis generate single-line descriptions for short videos. But many real-time videos are longer in length, and they can be described only using multi-line sentences. Thus, we propose a dense video captioning model that uses multimodal features. An approach is proposed to combine the RGB features and flow features with audio features using an audio-visual attention block. The event proposal generation module uses the output features from the audio-visual attention block to predict the event boundaries. The caption generation module uses multimodal features to generate descriptions for the proposed events and combine them to generate a dense caption. The proposed video captioning models are evaluated using the following benchmark datasets: Microsoft Video Description (MSVD) corpus, Microsoft Research Video to Text (MSR-VTT), and Charades dataset. The dense video captioning model is evaluated using the ActivityNet dataset. The ablation studies are conducted to analyze the effectiveness of the proposed approaches to video captioning. The evaluation results show that the proposed approaches are more effective than the other state-of-the-art approaches to video captioning. Web Conference Link :https://meet.google.com/anr-djsp-mbz |