CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle 2018

Taming the Beast of the Preimage in Machine Learning for Structured Data: Signal, Image and Graph – APi

Submission summary

The rise in prominence of Machine Learning (ML) has been undoubtedly driven by kernel machines and the revival of deep neural networks. The cornerstone of these machines is the pre-preprocessing of the data with a (cascade of) nonlinear transformation(s), which embeds them into a feature/latent space where data-processing techniques can be easily carried out. Apart from developing nonlinear models that outperform linear ones in supervised classification tasks, the nonlinear embedding is inevitable in numerous application areas. This is the case when dealing with discrete structured spaces, such as in chemioinformatics and bioinformatics where molecules are represented by strings and graphs, and in signal processing where data are time series with irregular sampling. In all these settings, since the Euclidean metric is inappropriate, data need to be embedded into a suitable space to carry out Euclidean-defined techniques.

While the data embedding is essential in ML, the inverse embedding is of great importance in many pattern recognition and data mining problems. Indeed, one often needs to extract patterns in the data space, not in the implicit feature one, such as for instance the estimation of the barycenter of a set of graph data needs to be done in the graph space, not in the embedding space. The challenging inverse embedding is the preimage problem. Estimating the preimage is a hard ill-posed optimization problem, due to the nonlinear, often implicit, embedding. Moreover, it is even harder when dealing with discrete structured spaces, such as in all the aforementioned domains (time series, strings and graph data).

This project aims to unlock the potential of ML for unsupervised learning, by addressing the fundamental issue of the preimage in all its forms in kernel machines and deep learning. To this end, four clearly identified attack points will be investigated:
- Establish a novel class of ML algorithms that do not suffer from the curse of the preimage, by investigating a paradigm shift in nonlinear ML thanks to closed-form solutions using probabilistic models or joint optimization strategies.
- Set up pattern recognition with compact representation and metric learning for time series, and more generally temporal data, by examining recurrent and hierarchical neural networks, temporal kernel machines, and Siamese networks for temporal metric learning, in the light of the preimage problem.
- Explore pattern recognition on discrete structured spaces, namely string and graph data, with an emphasis on the task of synthesizing a molecule from a set of molecules represented by graphs, as in bioinformatics and chemioinformatics (e.g. drug design).
- Associate the preimage problem in kernel machines with two classes of neural networks, autoencoders and the emerging generative adversarial networks (GANs), in order to provide deeper insights of their underlying functioning, and stimulate the development and exchange of ideas to design new architectures.

While the major contributions will essentially be theoretical in ML, the project will tackle several applicative domains concerning other scientific communities, with an interest in the signal processing, bioinformatics and chemioinformatics. Moreover, since the preimage problem is pervasive in the multi-disciplinary ML field, this project also seeks to broaden the scope of application to new areas, such as multimodal/heterogeneous data fusion.

Each of these working directions will investigate both kernel machines and deep learning. By providing a cross-fertilization of ideas, this project is expected to bridge the gap in unsupervised learning between these classes of ML algorithms. In order to address all these challenging points, it will create a synergy between renowned research teams. It is worth noting that the consortium members have recently carried out, separately, some encouraging preliminary studies showing the relevance, and importance of these working directions.

Paul HONEINE (LABORATOIRE D'INFORMATIQUE, DE TRAITEMENT DE L'INFORMATION ET DES SYSTÈMES - EA 4108)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LTCI Laboratoire Traitement et Communication de l'Information
LITIS LABORATOIRE D'INFORMATIQUE, DE TRAITEMENT DE L'INFORMATION ET DES SYSTÈMES - EA 4108
LIG Laboratoire d'Informatique de Grenoble

Help of the ANR 527,443 euros
Beginning and duration of the scientific project: - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.