We propose to tackle the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. We aim at not only disambiguating one modality by using the other appropriately but also to jointly disambiguate both by representing them in a common space. The full set of contributions proposed in the project will then be used to solve a new task that we define, namely Multimedia Question Answering
In the project, we consider three types of modality, namely (1) the visual modality extracted<br />from pixels of the images (2) the textual modality extracted from the questions in natural language,<br />the captions and other textual contents that are “near” an image and the textual documents used<br />to populate a knowledge database with regard to entities and, (3) by a slight misnomer that will<br />ease the understanding in the following, the structural modality that reflects the links that are<br />identified between the entities and are recorded in the knowledge database. MEERQAT thus<br />aims at answering a question composed of textual and visual modalities by relying on a knowledge<br />database that contains information relative to the visual, the textual and the structural modalities.<br />One of the main challenges consists in relating the three modalities, that do not lie in the same<br />feature space, in order to combine then compare them. An objective of the MEERQAT project is<br />thus to build a common space to these three modalities.<br />A weak fusion of modalities can quite easily be considered at a late level. This approach is<br />quite classical in information retrieval, either textual and visual; an alternative is to mainly rely on<br />one modality and filter or re-rank the results using the other ones. A more challenging objective<br />is to infer the similarity between information from different modalities at an early stage, possibly<br />at the representation level. An ultimate goal would consist to relate information independently of<br />the modality used to express it. In that sense, pushing the frontier from a late fusion of modalities<br />toward an earlier fusion contributes to such a goal, by proposing a new knowledge representation<br />frame that better matches the actual “sense” of queries and entities.<br />Beyond an efficient combination of the considered modalities to infer the meaning of a question,<br />the performance of a system that would respond to the MQA task strongly depends on the ability<br />to adequately represent the extracted data in each modality. However, since we aim at ultimately<br />mixing the modalities, it makes sense to include some information coming from the other modalities<br />at an early stage.<br />The scientific objective of the project is to learn a representation and an appropriate combi-<br />nation of the three modalities (visual, textual, structure) in a common space that will allow to<br />disambiguate the visual and textual content in order to tackle the multimedia question answering<br />task we propose. To this purpose, several innovative contributions are considered:<br />• tackling visual and textual entity ambiguities, by proposing new embeddings of each modality<br />in relation with the KB (WP2-3) but also by combining them in a common space (WP4);<br />• developing a fundamental work on the definition of an entity, in particular with regard to<br />the content to consider to represent it, depending on its type: person, place, organiza-<br />tion. . . (WP4);<br />• improving the recognition of textual entities using a KB enriched with visual data (WP3-4);<br /> studying the relation between visual and textual modalities for enabling ambiguity resolution<br />and a better understanding of a multimedia input (WP2-4);<br />• defining a new scientific task, Multimedia Question Answering (MQA), that relies on a KB<br />containing millions of multimedia entities (WP5) and releasing a related public benchmark.
to be developped in the project
We propose to tackle the problem of ambiguities of visual and textual content by learning then combining their representations. As a final use case, we propose to solve a new scientific task, namely Multimedia Question Answering, that requires to rely on three different sources of information to answer a (textual) question with regard to visual data as well as an external knowledge base containing millions of unique entities, each being represetd by textual and visual content as well as some links to other entities. In practice, we focus on four types of entities, namely the persons, the organisations (companies, NGOs, intergovernmental organizations...), the geographical points of interest (touristic places, remarquable buildings...) and the objects (commercial products...). Achieving such an objective requires to progress on the disambiguation of each modality with respect to the other and the knowdge base. We also propose to merge the representations into a common tri-modal space, in which one should determine the content to associate to an entity to adequately represent it with regard to its type (person, object, organisation, place). An important work will deal with the representaiton of a particular entity into the common space, in which one should determine the content to associate to an entity to adequately represent it. Since such an entity can be associated to several vectors, each corresponding to a data that is originally in a possible different modality, the challenge consists in defining a representation that is quite compact (for permances) while still expressive enough to reflect the potential links of the entioty with a variety of other ones. The project has a potential economic impact in the fields of data intelligence, including applications in marketing, security, tourism and cultural heritage. In case of success, the output of the MEERQAT project could directly contribute to improve chatbots. During the project, the direct output will be mainly academic, that scientfic article with the corresponding material to reproduce experiments. We also plan to release a new benchmark for the proposed task, in the context of an international evaluation campaign.
Monsieur Hervé Le Borgne (Laboratoire d'Intégration des Systèmes et des Technologies)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
LIST Laboratoire d'Intégration des Systèmes et des Technologies
IRIT Institut de Recherche en Informatique de Toulouse
Inria Rennes Bretagne - Atlantique Centre de Recherche Inria Rennes - Bretagne Atlantique
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
Help of the ANR 674,270 euros
Beginning and duration of the scientific project: March 2020 - 42 Months