CE40 - Mathématiques, informatique théorique, automatique et traitement du signal

Approximate Bayesian solutions for the interpretation of large datasets and complex models – ABSint

Absinte

This project is motivated by branches of applications<br />where large data sets and complex models represent a genuine challenge: population genetics and phylogenetic analyses, neuroscience and astrostatistics.The expected impacts of the project are two-fold: (1) scientific impacts for the developments of sound statistical tools to analyse massive datasets and complex models and (2) search for approximate solutions to analyse more advanced data structure.

developments of sound statistical tools to analyse massive datasets and complex models & search for approximate solutions to analyse more advanced data structure.

Our central purpose is to<br />rovide a wider and generic array of statistical tools able to handle “big data” without jeopardising<br />either the depth of the statistical analysis or the precision of the statistical predictions derived<br />from such data.<br />The impacts of the project are two-fold: (1) scientific impacts for the developments of sound<br />statistical tools to analyse massive datasets and complex models and (2) consequences on the<br />applied fields driving the search for approximate solutions in their ability to analyse more advanced<br />data structure.

ABC: scalability, variance reduction (Rao-Blackwellisation), dependency structures in variational Bayes inference, expectation
propagation, dimensionality reduction via random projections and quasi-sufficiency tests, Variational approximations for random graph models,
Monte Carlo: deterministic approaches like quasi-Monte Carlo (QMC), Bayesian quadrature, and functional control variates, variational Bayes, Expectation-Propagation,
measures of uncertainty: asymptotic properties of posterior distributions in complex and high dimensional models, Bernstein -von Mises property, Nonparametric Poisson regression, Bayes factors for nonparametrics, sparsity priors, Hawkes processes, Hidden Markov models, curve classification

diffusion of the new version of the DIYABC software, DIYABC-RF (2020)

Chapuis, M.-P., Raynal, L., Plantamp, C., Meynard, C. N., Blondin, L., Marin, J.-M., and Estoup, A. (2020). A young age of subspecific divergence in the desert locust inferred by ABC random forest. Molecular Ecology, 29(23) :4542–4558.
Clarté, G., Robert, C. P., Ryder, R. J., and Stoehr, J. (2020). Componentwise approximate Bayesian computation via Gibbs–like steps. Biometrika. To appear.

Collin, F.-D., Durif, G., Raynal, L., Lombaert, E., Gautier, M., Vitalis, R., Marin, J.-M., and Estoup, A. (2020). Diyabc random forest v1.0 : extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms. Molecular Ecology Resources. To appear.

Durmus, A., Majewski, S., and Miasojedow, B. (2019). Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1) :2666–2711.

Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., and Stöter, F.-R. (2019). Sliced-Wasserstein flows : Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pages 4104–4113. PMLR.

diffusion of the new version of the DIYABC software, DIYABC-RF (2020)

organisation of the One World ABC virtual seminars (2020- ) and of the ABC in Svalbard virtual workshop (2021)

Chapuis, M.-P., Raynal, L., Plantamp, C., Meynard, C. N., Blondin, L., Marin, J.-M., and Estoup, A. (2020). A young age of subspecific divergence in the desert locust inferred by ABC random forest. Molecular Ecology, 29(23) :4542–4558.
Clarté, G., Robert, C. P., Ryder, R. J., and Stoehr, J. (2020). Componentwise approximate Bayesian computation via Gibbs–like steps. Biometrika. To appear.

Collin, F.-D., Durif, G., Raynal, L., Lombaert, E., Gautier, M., Vitalis, R., Marin, J.-M., and Estoup, A. (2020). Diyabc random forest v1.0 : extending approximate Bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms. Molecular Ecology Resources. To appear.

Durmus, A., Majewski, S., and Miasojedow, B. (2019). Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1) :2666–2711.

Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., and Stöter, F.-R. (2019). Sliced-Wasserstein flows : Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pages 4104–4113. PMLR.

While the 1990s witnessed a tremendous acceleration in the development of powerful computing tools and of associated algorithms, primarily thanks to the Monte Carlo Markov Chain (MCMC) revolution, the current era of "Big Data" and of complex parameter models underscores the limits of this paradigm. which has by now become a "traditional" approach. Such limitations can be associated to either the enormous amount of data to be processed by current models or the very structure of ever expanding probabilistic and mechanical models, as for example when they involve too many parameters to allow for inference. Many examples of this difficulty or even impossibility of computing statistical procedures and of producing feasible statistical inference can be found in biology (genomics, proteomics), in the analysis of networks, the signal and the image.

Nonetheless, thanks to the very same tools, Bayesian non-parametric statistics is now an important area of ??research in statistics and machine-learning, and a recognized methodology in applied fields, for its theoretical developments, with better convergence characteristics in both well and badly specified models, as well as in terms of methodology. It is clear, however, that the convergence properties associated with such procedures are not applicable to a large number of modelling problems and need be replaced by other structures or procedures.

We have thus reached a turning point for the methodological and algorithmic tools that have made the Bayesian analysis particularly successful in many applied fields and which back up a theoretically valid approach for statistical inference. These tools must therefore adapt or disappear when faced with the present pressure of more rudimentary optimization tools that manage to offer (partial) snapshots of the model to be estimated in a very short time. much shorter that the production of a standard Bayesian inference. Since we defend the foundational perspective that Bayesian analysis (and statistics as a whole) provides an added value to machine learning outputs, by covering both the problem of model selection and the analysis of the uncertainty attached to any inference, we aim in this project at validating and extending our current tools to manage to overcome this crisis of fundamentals, henceforth proposing approximate Bayesian methods that have begun to emerge in recent years from specific areas of applications like population genetics.

A first direction of this project thus focuses on approximate Bayesian inference tools, their extensions, their calibration and their potential validation. The subject must of course be understood in a broad sense that covers the specific areas of research of the members of the research teams, including ABC (approximate Bayesian computation, also known as likelihood-free methods), expectation-propagation (EP) and variational Bayes approximations. These techniques share the property of approximating and of analysing models where the true likelihood function cannot be evaluated numerically or completed into a manageable model. We aim to combine these methods into a single class of methods, towards the aggregation of multiple non-parametric Bayesian techniques to obtain more efficient approximations, and to simultaneously provide a degree of validation of these approximations.

A second and related theme of this project is the study of the asymptotic properties of posterior distributions in complex high-dimensional models, towards producing robust Bayesian uncertainty measures, such as credible regions. We will study generic approaches in terms of their modelling capabilities and focus more on the two families of specific sampling problems motivated by the large-scale applications discussed in this project.

Project coordination

Christian Robert (Centre de recherches en mathématiques de la décision)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

University of Oxford / Statistics
CEREMADE Centre de recherches en mathématiques de la décision
CMAP Centre de Mathématiques Appliquées
IMAG Institut Montpelliérain Alexander Grothendieck

Help of the ANR 345,150 euros
Beginning and duration of the scientific project: December 2018 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter