Towards visual reasoning in deep learning – VISA DEEP
VISA DEEP
In the last decade, Machine Learning and Deep Learning have been at the heart of a technological and scientific revolution. It is a rebirth for Artificial Intelligence with huge advances in automatic translation, audio understanding and computer vision. Driven by these breakthroughs, many application fields are concerned. They involve complex reasoning tasks.<br />L. Bottou discusses several approaches to implement machine reasoning. Following his comments on differences between first order logic reasoning, probabilistic reasoning and others, in this AI chair project, we are focusing on trainable reasoning systems suitable for visual understanding tasks.
Trainable reasoning systems in deep learning
In this AI chair, we propose to investigate tasks of visual reasoning beyond merely ImageNet classification. It is required to decline some reasoning processes in the visual analysis scheme. We intend to explore the combination of elementary reasoning blocks into deep architectures. We want to question the type of blocks, structures and rules (if any) we can include.<br />The main requirement in terms of types of structures we consider is to get a final hybrid (Explicit/Implicit) architecture that is end-to-end trainable. We already experimented in several contexts how performance increases using fully trainable architectures. Getting a differentiable function for the final DNN greatly constraints the type of combination or the nature of reasoning.
Our first objective deals with identification and reduction of biases in learning models, in particular for multimodal tasks such as Visual Question Answering (VQA). This task involves answering complex questions about images, and requires visual reasoning skills. Many biases have been highlighted on this task.
Our second objective aims at improving the robustness of deep learning networks, not despite the complexity of the models and data, but thanks to their diversity, respectively in predictions and in distribution. We analysed deep ensembles, where multiple models are trained independently and then averaged to improve performances. We are also considering setups in which multiple domains are given, and we propose a new invariant regularization to promote the learning of a causal mechanism consistent across domains.
Scientific Publications
1. MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks, A Rame, R Sun, M Cord, ICCV 2021
2. DICE: DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation, A Rame, M Cord, ICLR 2021
3. Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering, C Dancette, R Cadene, D Teney, M Cord, ICCV 2021
4. Overcoming Statistical Shortcuts for Open-ended Visual Counting
C Dancette, R Cadene, X Chen, M Cord, Visual Question Answering workshop, CVPR 2021
5. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, A Douillard, A Ramé, G Couairon, M Cord, CVPR 2022
5
In the last decade, Machine Learning and Deep Learning Networks (DNN) have been at the heart of a technological and scientific revolution of Artificial Intelligence (AI). In image classification, DNN are state-of-the-art approaches since 2012, where for the first time the great ImageNet competition was won by a deep neural network.
In this AI chair, we propose to investigate tasks of visual reasoning beyond merely ImageNet classification. It requires to decline some reasoning processes in the visual analysis scheme. We intend to explore the combination of reasoning blocks into deep architectures. We want to question the type of blocks, structures and rules. The main requirement in terms of types of structures we consider is to get a final hybrid (Explicit/Implicit reasoning) architecture that is end-to-end trainable. Getting a differentiable function for the final DNN greatly constraints the type of combination or the nature of reasoning.
Our first framework deals with the design of such DNN, leveraging visual reasoning mechanisms. We propose to investigate different aspects of this problem:
– Developing semi-explicit approaches
– Introducing advanced merging module for DNN
– Exploring visual attention mechanisms
Recently, the Computer Vision community has developed a very interesting playground to instantiate visual reasoning, that is the Visual Question Answering task (VQA). We will consider different contexts including VQA to experiment our proposals. We are also sensitive to another recurrent problem in machine learning tasks: biases. VQA Datasets often have strong correlations between the question and the answer, so models learn to rely mostly on the content of the question, and not enough on the image. We will like to investigate learning approaches that limit these biais effect.
Our second framework concerns the autonomous driving problem. We want to focus on building driving systems that can provide a clear explanation of their behavior. In the ideal case, a model should be able to explain its decisions to users. We are motivated by the explanation capacities our designed reasoning models may exhibit. Visualization is related to the concept of explainability. Developing new visualization strategies will therefore be a major issue in our research project. In particular, we will consider visualizing the internal processes performed by our deep models. In the spatial attention case, visualizing the saliency maps can provide a man-understandable signal for explaining the behavior of a network. We will consider this type of strategy in the context of autonomous driving to develop convincing models of decision explanation. More generally, the explanation can take different forms (textual, visual) and should be understandable by the user. For autonomous driving, understanding the car's decision of the car is an important factor of trust and transparency.
This is a major step in understanding complex visual processing, and then rethinking or adapting deep architectures accordingly.
Project coordination
Matthieu CORD (Laboratoire d'informatique de Paris 6)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partner
LIP6 Laboratoire d'informatique de Paris 6
Help of the ANR 594,000 euros
Beginning and duration of the scientific project:
August 2020
- 48 Months