Title | : | Visually Grounded Reasoning in the era of Pre-trained Models |
Speaker | : | Sanjay Subramanian (CS PhD Student, UC Berkeley) |
Details | : | Thu, 21 Dec, 2023 11:00 AM @ SSB-233 |
Abstract: | : | Two important goals for visual reasoning systems are transferability across visual domains and the ability to handle complex, multi-step queries. We show that recent pre-trained models, along with carefully designed inference procedures, enable significant advances toward both of these goals without requiring a large amount of labeled data. In the first part of the talk, we show how to build a zero-shot referring expression comprehension system using CLIP. Our method ReCLIP performs well across visual domains, unlike previous methods trained on labeled data in one domain. In the second part of the talk, we present a method for multi-step visual question answering that consists of a code generation step powered by a language model followed by a code execution step powered by off-the-shelf vision models. We show that our approach CodeVQA improves over prior work in few-shot visual question answering. Throughout the talk, we will also highlight important limitations and challenges relating to today's pre-trained models. |