Zero-Shot Video Question Answering with Procedural Programs
By Ashlyn Lacovara
Researchers Rohan Choudhury, Kris M. Kitani, and László A. Jeni from the Robotics Institute at Carnegie Mellon University, along with Koichiro Niinuma from Fujitsu Research, have developed a method designed to answer video-related questions by generating, then executing short programs that obtain a final answer through the resolution of sequential visual subtasks. While most computer vision methods attempt to provide an answer to a question in one step, ProViQ breaks visual tasks down into smaller, sequential subtask. Each subtask involves using separate bespoke visual models as tools with the overall goal being to solve a larger visual problem. Each step in the program builds on the results of the previous one.
Procedural Video Querying (ProViQ) leverages a large language model to create programs based on an input question and a set of visual modules provided in the prompt. It then executes these programs to obtain the desired output. Recent advancements in procedural methodologies have demonstrated success in image question answering; however, video analysis continues to present significant challenges. ProViQ, incorporates specialized modules designed for video understanding, thereby enhancing its ability to generalize across various video formats.
“One advantage of our approach is that we can design bespoke tools for ProViQ to use for certain tasks. The language model can figure out which tools to use for which task simply through prompting. For example, we use video captioning models and LLMs to summarize long videos, which leads to a large improvement on the Egoschema question-answering benchmark. ”
This code generation framework further equips ProViQ to undertake additional video-related tasks beyond question answering, including multi-object tracking and basic video editing. ProViQ has achieved state-of-the-art performance on a range of academic benchmarks, with enhancements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.
Please follow this link to read more about this research, which will be presented at ECCV 2024.