From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering


Video understanding has achieved great success in rep- resentation learning, such as video caption, video object grounding, and video descriptive question-answer. How- ever, current methods still struggle on video reasoning, including evidence reasoning and commonsense reason- ing. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene de- scription (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfac- tual). For commonsense reasoning, we set up a two- step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video under- standing from representation learning to deeper reasoning. The dataset and related resources are available at https: //

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)