Title | : | Towards Efficient Deep Learning Networks for Superior Video Generation |
Speaker | : | Sonam Gupta (IITM) |
Details | : | Tue, 2 Apr, 2024 4:00 PM @ SSB-233 |
Abstract: | : | The field of artificial intelligence has recently witnessed significant advancements in generating realistic visual data. One of the challenging and interesting domains for this is video synthesis that has numerous real-world applications ranging from content creation such as text to video generation, generating more samples from the underlying data distribution, video editing, representation learning etc. The task of video generation is challenging due to the presence of temporal dimension. Unlike images, video contains a sequence of frames that transition over time. For a video to be perceived as real, the video generative model should be capable of modelling the spatial and temporal dynamics simultaneously such that the generated frames are visually appealing as well as temporally coherent.
In our earlier work, we had aimed to design sophisticated deep learning architectures that address the current gaps in existing works and had proposed various architectures which generate superior quality videos then SOTA methods. To be specific, we explore the use of Generative adversarial networks (GANs) for video synthesis. We propose two novel approaches for video synthesis. The first approach utilized a wide generative adversarial network (GAN) architecture with complementary feature learning. We hypothesized that a wide network could learn richer global and local features as multiple branches can process distinct attributes of the video. The second method explored a recurrent GAN architecture consisting of a recurrent generator and convolutional discriminator along with the proposed TranConv LSTM. This architecture design facilitates longer video generation during inference. We also investigate the generalizability of these approaches on conditional video generation tasks like class conditional video generation and text-to-video generation. Results are shown on a few benchmark video datasets, to illustrate the efficiency of our proposed approaches.
Complementary to the above works where we focus upon improving the quality of the generated videos by carefully designing the neural network architecture, we also attempt to build a video representation that can serve as a more robust and powerful backbone for the video generation network. Previous works have relied on building Implicit neural representations (INRs) where a video is represented using the weights of a small neural network. This representation is more powerful than the discrete representation of a video since this is agnostic to the video resolution. In our latest work, we propose a Polynomial Implicit neural representation of videos, called PNeRV. The polynomial nature of the representation allows the INR to capture the spatio-temporal dynamics effectively while allowing several downstream applications like video compression, video denoising, video super-resolution etc. PNeRV not only addresses the challenges posed by video data in the realm of INRs but also opens new avenues for advanced video processing and analysis. Meeting link: meet.google.com/kpi-iezm-nvp |