Annotation de données visuelles avec des descriptions sémantiques – ViSen
Today a typical Web document will contain a mix of visual and textual content. Most
traditional tools for search and retrieval can successfully handle textual content, but are not
prepared to handle hetereogeneous documents. The new type of content demands the
development of new efficient tools for search and retrieval.
The visual sense project aims at mining automatically the semantic content of visual data to
enable “machine reading” of images. In recent years, we have witnessed significant
advances in the automatic recognition of visual concepts (VCR). These advances allowed
for the creation of systems that can automatically generate keyword-based image
annotations. The goal of this project is to move a step forward and predict semantic image
representations that can be used to generate more informative sentence-based image
annotations. Thus, facilitating search and browsing of large multi-modal collections. More
specifically, the project targets three case studies, namely image annotation, re-ranking for
image search, and automatic image illustration of articles. It will address the following key
open research challenges:
1. To develop methods that can predict a semantic representation of visual content. This
representation will go beyond the detection of objects and scenes and will also recognize a
wide range of object relations.
2. To extend state-of-the-art natural language techniques to the tasks of mining large
collections of multi-modal documents and generating image captions using both semantic
representations of visual content and object/scene type models derived from semantic
representations of the multi-modal documents.
3. To develop learning algorithms that can exploit available multi-modal data to discover
mappings between visual and textual content. These algorithms should be able to leverage
‘weakly’ annotated data and be robust to large amounts of noise.
For this purpose, the current project will build on expertise from multiple disciplines, including
computer vision, machine learning and natural language processing (NLP), and gathers four
research groups from University of Surrey (Surrey, UK), Institut de Robòtica i Informàtica
Industrial (IRI, Spain) , Ecole Centrale de Lyon (ECL, France), and University of Sheffield
(Sheffield, UK) having each well established and complementary expertise in their respective
areas of research.
Project coordination
Krystian MIKOLAJCZYK (University of Surrey/Department of Electronic Engineering) – eranet_K.Mikolajczyk@surrey.ac.uk
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partner
IRII Institut de Robòtica i Informàtica Industrial
UoSu University of Surrey/Department of Electronic Engineering
UoSh University of Sheffield, Department of Computer Science
ECL LIRIS Ecole Centrale de Lyon, Laboratoire d'InfoRmatique en Image et Systèmes d'information
Help of the ANR 296,475 euros
Beginning and duration of the scientific project:
December 2012
- 42 Months