DS0708 - Données massives, connaissances, décision, calcul haute performance et simulation numérique

large-scale machine learning and applications – MACARON

Submission summary

Statistical modeling requires representing measurements of some physical phenomenon as computationally manageable data, before learning a model that fits some observations. Recently, models involving a large number of parameters have gained significant success in solving difficult prediction tasks, notably in computer vision with the emergence of the so-called deep learning techniques. It is thus appealing to use more general huge-dimensional models for tackling scientific and technological problems, but such approaches raise new methodological challenges: (i) exploiting the prediction capabilities of huge-dimensional models often goes along with using a large amount of training data, and computational techniques that are both scalable in the model and data size remain to be developed; (ii) huge-dimensional models are hard to visualize and interpret, which is problematic whenever understanding these models is important, e.g., in experimental sciences.

The project MACARON is an endeavor to develop new mathematical and algorithmic tools for solving the above challenges. Our ultimate goal is to use data for solving scientific problems and automatically converting data into scientific knowledge by using machine learning techniques. Therefore, our project has two different axes, a methodological one, and an applied one driven by explicit problems. The methodological axis addresses the limitations of current machine learning for simultaneously dealing with large-scale data and huge models. The second axis addresses open scientific problems in bioinformatics, computer vision, image processing, and neuroscience, where a massive amount of data is currently produced, and where huge-dimensional models yield similar computational problems. Our project involves a pluri-disciplinary team with experts from all these fields, making it possible to develop machine learning techniques with concrete scientific and technological impact.

In the methodological axis, we propose to explore new directions in machine learning that jointly leverage two principles: (i) stochastic optimization, which is now classical for dealing with a large amount of training data; (ii) sparse estimation in structured model parameter spaces. Sparsity of a problem solution is indeed a crucial asset for interpreting huge-dimensional models, but it can also be exploited to obtain fast algorithms when the parameter space enjoys a particular structure, e.g., hierarchical, or low-rank. In the context of our project, the structure will be either learned from data, or designed for the purpose of fast feature selection.

In the applied axis, we have identified several scientific problems that can benefit from new technologies developed along the lines of the first axis. We have already obtained promising preliminary results in bioinformatics for next generation DNA/RNA sequencing, where we perform feature selection in a model of exponential size. We will pursue our effort in this direction and also address the problem of genomic imputation and haplotype phasing, which can exploit fast matrix completion techniques. We will also apply new huge-dimensional image models to computer vision, image processing, and neuroscience of the visual cortex. Even though these three fields are far from each other in hindsight, they give rise to prediction tasks that are highly related. For instance, given an input image, we may be interested in being able to predict the image content (computer vision), restore it (image processing), or predict neuronal activity when the image is shown to a subject (neuroscience). State-of-the-art predictive models in these three fields all rely on underlying image models. Richer, higher-dimensional models should allow us to achieve a better prediction performance, and, with equal importance, a better understanding of the data we are analyzing.

Project coordination

Julien Mairal (Centre de Recherche Inria Grenoble Rhône-Alpes)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Inria Centre de Recherche Inria Grenoble Rhône-Alpes

Help of the ANR 349,979 euros
Beginning and duration of the scientific project: September 2014 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter