Blanc SIMI 4 - Blanc - SIMI 4 - Physique des milieux condensés et dilués

From sequences to structure: statistical-physics methods to infer co-evolutionary constraints in proteins and RNAs – COEVSTAT

Submission summary

In the course of evolution, structure and function of biomolecules, e.g. proteins, RNA, are remarkably conserved, whereas amino-acid/nucleotide sequences vary strongly between homologous, i.e. evolutionarily related molecules. However, structural conservation constrains sequence variability, forcing different residues (amino acids/nucleotides) to co-evolve: residues being close in the three-dimensional structure (but possibly distant along the sequence) will typically evolve in a correlated way. Thanks to the recent emergence of fast and cheap sequencing technologies, genomic databases are growing exponentially; it is an important challenge to use such data, and the empirically observed variability in homologous protein and RNA families, to infer co-evolutionary constraints, and to subsequently gain insight into biomolecular structure and function from sequence information alone.
In a maximum-entropy formulation, this inference task becomes equivalent to the statistical physics of inverse problems: Starting from the observed correlations between the variables of a Q-state Potts model (Q=5 for nucleotides and Q=21 for amino-acids, including alignment gaps), one has to infer the model itself, i.e. the coupling parameters and the local fields appearing in the system’s Hamiltonian. This problem, which is intrinsically harder than the direct problem (calculating thermodynamic observables for a given model), has recently attracted much attention in statistical physics and computer science, and has developed into a vivid research field by itself.
The main objective of this proposal is to exploit this formal equivalence, and to bring the methodological wealth of the modern statistical physics of disordered systems to the full benefit of biological inference, going substantially beyond current ideas based on simple mean-field approximations. We aim at developing computationally efficient and highly accurate methods to infer co-evolutionary constraints. We will use them to reconstruct contact maps for proteins on a large scale (all >4,000 known protein families containing sufficient sequences for statistical inference), which, subsequently, shall be used to predict tertiary (three-dimensional) protein structures. We will also use the developed techniques to study co-evolutionary constraints in RNA, trying to step beyond Watson-Crick base pairing in RNA secondary structure.
This requires a tight interdisciplinary collaboration and cross-fertilization between statistical physics and computational biology: Statistical-physics approaches have to be adapted for the specific needs in sequence-based inference (e.g. integration of prior biological information, interfacing with proteins/RNA databases, including gaps and inserts into the statistical-physics formalism), and have to be validated on both artificial and biological data. In turn, the analysis of biological data raises very interesting issues, such as the need to solve inverse problems with defective sampling due to the limited availability of data, the unraveling of phylogenetic (historical) correlations in protein sequences, the design of extended Potts models where the number of variables is not fixed to go beyond Hidden Markov Models, etc. Such questions are new and of fundamental interest from a statistical physics point of view.
Doing so, we aim at a breakthrough (a) in the use of statistical-physics approaches to solve inverse problems, (b) in the interdisciplinary applicability of these methods to the description of complex biological systems, and (c) in the use of co-evolutionary sequence information for the understanding of protein and RNA structure and function. The nature of our project is therefore both fundamental and concrete. We expect a strong impact in statistical physics and in computational biology.

Project coordination

Simona COCCO (UMR 8550 Laboratoire de Physique Statistique de l'Ecole Normale Supérieure)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LPT-ENS Laboratoire de Physique Theorique de L'Ecole Normale Supérieure
LGM-UPMC Laboratoire de Génomique de Microorganismes de l' Université Pierre et Marie Curie
UMR 8550 LPS-ENS UMR 8550 Laboratoire de Physique Statistique de l'Ecole Normale Supérieure

Help of the ANR 257,398 euros
Beginning and duration of the scientific project: September 2013 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter