ChairesIA_2019_2 - Chaires de recherche et d'enseignement en Intelligence Artificielle - vague 2 de l'édition 2019 2020

Intelligent Analysis and Interconnexion of Heterogeneous Contents in Digital Arenas – SourcesSay

Submission summary

Digital data, whether text (news articles), semi-structured (tweets, other social media content) or structured (RDF or CSV files) are produced and shared at very large speed today. As data brings a digital mirror of human activity, in particular democracy and public debate, intelligently exploiting it is of crucial importance. This requires, in particular, algorithms and methods to enable, but also preserve and sanitize, the exchange of information between users and/or institutions. We identify an arena asa set of organizations and individuals, together with the structured and unstructured content
(data) on a given topic, on which they work or which they exchange.

To check the truthfulness of a statement made by an individual or an organization, one needs to search in the arena's digital content for data which may confirm or deny it, and also to analyze and interpret the context of the statement and of those making it; this requires interconnecting digital data sources.
Further, to support public trust in the result of truthfulness checks, it is of crucial importance to model and preserve in the arena, as a first-class citizen, any data source of fragment thereof. This is because we must be able to immediately show concrete evidence (source) of each statement and/or extracted information.

To support these goals, we propose to study, develop and deploy Arena Management Systems (AMSs), a new brand of intelligent, learning-based content management systems. Users, such as journalists or citizens interested in a given topic, simply drag-and-drop data sources of any kind (text, tabular, semistructured such as JSON or RDF) into an AMS, and have them automatically analyzed and integrated as follows.
A graph is built from both the internal structure that data sources may have, and through extraction of entities, relationships, and "weaker signal" under the form of linkable elements, e.g., codes, categories/hashtags, email addresses etc. The interest of linkable elements is to enable interconnections across data sources even when such elements cannot be reliably attached to any entity in a data or knowledge base. Our prior work with journalists from Le Monde has taught us that such low coverage is the norm rather than the exception in their work; for instance, in high-profile investigations on the Panama Papers or Russian interference analysis, connections are made through little-known names of companies, blogger IDs etc. An AMS stores and indexes the graphs, and supports querying by means of keywords (with the semantics of finding interesting trees that connect a given set of keywords), but also through interactive visual exploration.
The AMS goal differs from building a knowledge graph, because (1) our practical experience with journalists shows a lot of value relies in content for which the available information is insufficient (signal too weak) to reliably extract entities and relationships; (2) as a representation and interaction paradigm, we chose to rely on (interconnected) data sources that users identify and can trust, whereas other paradigms (e.g., an extracted graph possibly annotated with probabilities or provenance formulas) is scientifically alluring, but hard to trust or to interpret for the journalist end-users we target (the Le Monde newspaper and WeDoData, a media digital service company, will provide use cases, beta testing and feedback on our research).

Realizing the AMS goal crucially relies on and will result into classification and learning methods: to extract entities, relationships, and linkable elements; to learn from (a small amount of) user feedback when two nodes or linkable elements should be fused; to tune the heterogeneous graph storage and indexes so that they adapt to the users' interest and data characteristics in order to provide as efficiently as possible the most interesting answers for every search, and help emerge from the data interesting results users didn't know they wanted.

Ioana Manolescu (Centre de Recherche Inria Saclay - Île-de-France)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Inria Saclay - Ile de France - équipe CEDAR Centre de Recherche Inria Saclay - Île-de-France

Help of the ANR 587,980 euros
Beginning and duration of the scientific project: August 2020 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.