Title | : | Video Frame Prediction using Adversarial Techniques |
Speaker | : | Prateep Bhattacharjee (IITM) |
Details | : | Fri, 24 Aug, 2018 2:15 PM @ A M Turing Hall |
Abstract: | : | Video frame prediction has recently been a popular problem in the field of computer vision as it caters to a wide range of applications including autonomous vehicles, surveillance and robotics. The challenge lies in the fact that, real-world scenes tend to be complex, and predicting the future events requires modeling of complicated internal representations of the ongoing events. Majority of the past approaches focused on predicting the semantics which is useful for many decision making problems. Recent methods, although better than the semantics based frameworks, often underperform in environments differing greatly from the training set and in case of faraway predictions.
This talk will cover three novel approaches to overcome the aforementioned shortcomings in the field of video frame synthesis incorporating Generative Adversarial Networks (GAN). The first of these methods uses multiple stages of GANs trained using two novel objective functions: (a) Normalized Cross-Correlation Loss (NCCL) and (b) Pairwise Contrastive Divergence Loss (PCDL), for effectively capturing the inter-frame relationships. Performance analysis on UCF-101 and KITTI datasets shows the effectiveness of the proposed loss functions in predicting future frames of a video. The second method improves over the previously proposed NCCL objective, while using the GANs in an encoder-decoder like setup along with a novel Locally Guided Gram Loss (LGGL) formulation to bring the target and generated frames closer in an intermediate feature space. This Gram matrix based minimization results in a much stable training of the network along with visible performance improvements (both qualitative and quantitative). Finally, we shall introduce a novel pixel-graph based context aggregation layer (PixGraph) which efficiently captures long range dependencies without the need of any explicit specialized objective functions. This method elegantly handles the complex issue of separate objects moving in different directions and with very dissimilar speeds. We also argue that, for real world applications of frame prediction in areas such as autonomous vehicles, predicting semantic segmentations from the unseen future is a more challenging and necessary task. We show that our proposed graph based model is able to generalize well into predicting segmentations as well as raw RGB values deep into the future using two benchmark traffic video segmentation datasets - Cityscapes and CamVid. |