ChairesIA_2019_2 - Chaires de recherche et d'enseignement en Intelligence Artificielle - vague 2 de l'édition 2019

Learning data integration, from discrete entities to signals – LearnI

Submission summary

With data science, machine learning is changing how decisions are made in many fields such as health or business. However, the bottleneck is often not in the statistical analysis but in combining data of different nature or from different sources. Indeed, data integration still relies heavily on human intervention. Data of different nature need relational-database techniques, to represent and transform them via common entities. These representations and operations are inherently discrete, which makes statistical learning challenging: symbols cannot easily express shared information, eg to model ambiguity; optimizing logic rules applied to symbolic data leads to intractable combinatorial problems. Recent successes in text processing have shown that if symbols are modeled with vectors, machine learning can understand complex information, in particular with deep learning.

A new approach to data integration can replace human curation by machine learning: using continuous formulations instead of discrete representations and operations used for relational heterogeneous data can enable optimizing data assembly. For this, new statistical-learning architectures are needed.

Our challenge is to represent data of vastly different natures in the same metric space, and yet transform them differently. For this, we will:
1) use statistical regularities of databases to embed symbolic entries in vector spaces
2) create statistical-learning models that assemble different data sources with continuous transformations
3) enable transfer learning across databases with related application domain

Automating data integration will boost data-science applications, as many data sources are currently untapped due to the related cost. This is particularly true for reusing open data or surveying public health.

For this purpose, we will adapt to relational data statistical learning tools that extract and transform continuous representations. These will capture relations between entries, but also values, such as numerical attributes, building "neural language models", that model the local structure of a database. Aligning across databases will be treated as a domain-adaption problem, using distribution-matching tools. Building a supervised-learning model from these representations will require strong non-linearities, such as gating mechanisms, to distinguish data of different nature represented in the same vector space. To facilitate transfer learning, we will focus on representations, or transformations, of the data that can be easily reused in a many settings, as the "transformer" architecture that recently boosted natural language processing.

To provide the very large datasets needed to learn good representation, we will crawl public data. The corresponding representations will capture general knowledge and help assembling related data. We will focus on applications to public health, for instance for epidemiological settings. Easier data integration will enable to increase the sample sizes of studies, by assembling across sites such as different hospitals. It will also enable data augmentation, tapping in different sources of information, such as pollution data, to model more risk factors or potential confounders.

We will strive to expose the progress made not only in the form of academic papers, but also as tutorials accessible to data scientists outside of academia and inside high-quality open source software.

Project coordination

Gael Varoquaux (Centre de Recherche Inria Saclay - Île-de-France)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Inria Saclay - Ile-de-France - équipe PARIETAL Centre de Recherche Inria Saclay - Île-de-France

Help of the ANR 489,608 euros
Beginning and duration of the scientific project: August 2020 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter