DS0707 - Interactions humain-machine, objets connectés, contenus numériques, données massives et connaissance

Learning highly complex visual problems with deep structured models – DEEPVISION

Submission summary

Video data are ubiquitous. With the explosion of embedded devices and sensors, footage is collected from a variety of sources ranging from surveillance cameras to cell-phone cameras to wearable devices. However, video’s usefulness is limited by our ability to interpret and make decisions from it. Fundamental advances in the state of automated computer vision analysis of video data are of paramount importance to render this data useful.

Humans are the dominant, important subject in nearly all video. Improved algorithms for interpreting their behaviour would allow video data to impact a variety of industrially-relevant application domains. Video can be used to understand how people use public spaces for better design. Healthcare delivery and patient outcomes can be improved by video monitoring. First-person (egocentric) video can be used to augment people’s interaction with the world, either in the workplace or at home. Understanding human behaviour will also lead to improved human-computer and human-robot interaction.

Advances in computer vision have now made possible applications which have seemed almost impossible a few years ago. The automatic recognition of gestures in real time is now a frequent sight in millions of homes worldwide, where humans are able to interact with gaming consoles placed in sometimes dimly lit rooms, i.e. in difficult environments. Automatic face detection is now a standard feature in digital cameras, as is face recognition in certain brands of smartphones.

Machine learning is a major driving force behind this development. The vast amount of input data corresponding to visual data, as well as the inherently large variations in them due to different viewpoints, shapes, acquisition conditions etc., make automatic learning of predictive models an appealing approach compared to the engineering of hand-crafted models. In particular, the availability of vast amounts of labeled or unlabeled training data, combined with the development of new computational resources (GPUs), has led to an unexpected jump in performance of methods based on machine learning. The automatic learning of multi-layer representations from large amounts of data, targeted by the field ``deep learning'' has now emerged as a major force in computer vision.

In spite of these advances, the current state of the art in deep learning of representations suffers from several limitations. Current results suggest that reasonably large variations in input data can be learned for important applications like object recognition, gesture recognition, and the classification of short videos. However, very large variations inherent in many realistic situations are currently out of reach. This concerns inherently structural relationships in the data, as for instance person-person interactions, person-object interactions, long running dynamical behavior in video, deformable and articulated objects, etc. These properties are generally better handled using a family of methods called “structured models”, often based on graphical representations. Although better suited for capturing structural relationships, these models are less suited for automatic learning, making it difficult to extract rich information from large amounts of data.

In this project, we propose to create deep structured models, which combine the advantages of both families. Based on deep learning, they will be capable of learning rich representations from training data, while keeping the ability of structured models to handle complex relationships.

We propose a research program that:
- Is led by a team of the top video analysis researchers in France and Canada.
- Delivers a paradigm-shifting change in vision-based activity recognition.
- Fosters international collaboration through student and postdoc exchanges, workshops and distributed development practices.

Project coordination

Christian Wolf (INSA-Lyon, Laboratoire d'Informatique en Images et Systèmes d'information)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


UOG University of Guelph
SFU Simon Fraser University
INSA-Lyon/LIRIS INSA-Lyon, Laboratoire d'Informatique en Images et Systèmes d'information

Help of the ANR 447,920 euros
Beginning and duration of the scientific project: August 2016 - 36 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter