CE45 - Mathématiques et sciences du numérique pour la biologie et la santé 2019

Approches geometriques multi-résolution et multi-échelle pour la determination de structures bio-moleculaires – multiBioStruct

Our methods and approaches

We organize the text below by following our project main objectives.

For the study of the novel relational descriptors for topological and metric properties of the secondary structures, we have represented our proteins as graphs G where nodes are amino acids, and where we have extended the notion of "dihedral angle". We found out that the problem of determining the secondary structure of proteins (alpha-helices, beta-sheets, etc) can be reformulated as a classification problem on the vertices of the graph G. For performing this classification, we used the Machine Learning (ML) recursive method named "message passing" to train a neural network. We have used various setups for our experiments, including the possibility to have only one Carbon atom (C_alpha) to represent an entire amino acid.

With the aim of developing error-tolerant algorithms for the DGP, we have started the development of a new software tool where several methods for distance geometry are supposed to co-exist and interact. In contrast to the low-level (which however guarantees high performance) implementation of our algorithms, this new software tool is coded in Java, which supports the object-oriented paradigm. This choice for the programming language is motivated by our need of comparing several approaches and data structures for the DGP, and of combining several existing methods within a unique tool.

In order to study the possible conformations for the SERF1a protein, we have conducted the following experiments. (i) we performed experiments of Small and Wide-Angle X-ray Scattering (SWAXS) to obtain some curve measurements on our IDP sample; (ii) we developed a new approach for the calculation of the IDP conformations combining two approaches previously proposed by one of the partners of this consortium; (iii) we used the combined approach to analyze the SERF1a conformations; (iv) we developed a Gaussian mixture model where we used an expectation-maximization algorithm with the aim of finding conformer populations.

Towards the end of the project, and well after the projected end of the activities related to our Machine Learning approach to secondary structures, a new work in this same direction was carried out together with a colleague external to the project. This work exploits a Support Vector Machine (SVM) classifier. However it was not finalized by the end of the project, as it was started towards its end.

Résultats

The neural network developed for the prediction of topological and metric properties for secondary structures of the protein backbones was named Sequoia and published in [hal-03364652]. Another work, more generally related to distances between various entities, but specifically mentioning the application to proteins, was also published [hal-02892020].

In the study for the development of error-tolerant methods for the DGP, we have the following main results. We have investigated various alternative representations for biological molecules that are inspired by other DGP applications [hal-04183466,hal-04647132,hal-04167543]. In doing so, it turned out that the opposite approach was also interesting, i.e. to use biological approaches in other disciplines [hal-03688782,hal-03746440]. With the aim of better controling the error due to round-off approximations, we have studied the integration of the Levenberg-Marquardt algorithm in the general Branch-and-Prune (BP) framework for discretizable DGP [hal-03688777].

We have performed some work in abstract setups, where the dimension of the Euclidean space where we perform our embeddings is 1 and the data we consider are simply integer numbers [hal-03688779]. In this context, we have also looked at high performing computing approaches, such as CUDA programming on GPUs [hal-03746879], but also at the possibility to perform the computations in an analog fashion by means of an optical circuit [hal-03636293]. We have implemented a well-known heuristic approach (the VNS, Variable Neighbourhood Search) in Hamming space for better dealing with DGP instances for which a discretization can be supplied [hal-03688784,hal-04647131].

Simon Hengeveld's PhD thesis [tel-04607576] summarizes some of the main points above, and provides a wider view of the performed activities.

The Gaussian mixture expectation-maximisation approach was tested on the IDPs Sic1 and pSic1 [hal-03796134]. Preliminary results on SERF1a have been also published [hal-0418468].

In terms of software, the work preformed during the project allowed us to release two new versions of MDjeep [software: hal-04676343, publication: hal-03030154]. MDjeep is a software tool for distance geometry that we are developing since about 15 years. (all versions can be found at github.com/mucherino/mdjeep). Another software contribution worth mentioning is the Java code deposited at

- [https://github.com/simonheng/BP_ProteinFileReader]

which was used for the experiments presented in [hal-03636295].

Other publications in line with this project and co-authored by the project participants are

- [hal-03250708], a study on maximum feasible sets in the context of DGP;

- [hal-04680675], a theoretical study on the possibility to predict, a priori,

the number of solutions expected for discretizable DGPs,

- [hal-04183511], a study on the covalent geometry of protein conformations.

Perspectives

Many activities, that are a natural continuation of this project, are already being performed at the time we are writing this report. For example, the work performed with the aim of designing an error-tolerant approach for the DGP is currently being converted into Julia programming style, and some initial works for extending some of the methods and algorithms is ongoing.

At long term, we plan exploiting data we could obtain from other biophysical experiments, and possibly to combine these new data with the more traditional NMR and SWAXS data. One possibility we are looking at is given by the experiments of single molecular Fluorescence Resonance Energy Transfer (smFRET). Another is given by the experiments of Electron Paramagnetic Resonance (EPR). In fact, these experiments can both provide geometrical information between some amino acids forming the protein chain (in form of distances), but we expect the procedures for preparing and conducting these experiments to be more complex in comparison with NMR and SWAXS. In addition, we also plan to extend our protein targets to disordered proteins involved in cancer diseases (our current protein target, the SERF1a, is instead important in the context of neuro-degenerative diseases).

Part of these future works will be supported by our CNRS-IRP international project (involving three partners of this project, from 2024 to 2028, two in France and one in Taiwan), as well as the ANR-PRC EVARISTE (2024-2027), involving two partners of this project and mainly focusing on the aspects related to combinatorial optimization and machine learning.

Finally, a LUE (Lorraine Université d’Excellence) PhD project, between our partner in Taiwan and one of us, will be devoted to the development of clustering approaches for describing the conformation space of IDP, using data science methodologies.

Résumé de soumission

Ce projet se situe dans le contexte de la biologie structurale, où la géométrie des distances (DG) s’est révélée être un outil pertinent pour l'analyse et la détermination de structures biologiques, telles que les protéines. L'utilisation classique de DG se situe dans le cadre des expériences de résonance magnétique nucléaire (RMN), où à partir des distances entre paires d'atomes estimées expérimentalement, des conformations tridimensionnelles de la biomolécule doivent être identifiées. Ce problème est NP-difficile et a été abordé historiquement via l'utilisation d'heuristiques et de méthodes méta-heuristiques. Depuis quelques années, plusieurs partenaires du présent projet travaillent sur un approche de discrétisation pour DG qui permet d’utiliser un algorithme de type branche-and-prune (BP) pour l’identification de conformations tridimensionnelles. Un des points forts de cette approche de discrétisation est que le jeu des solutions DG peut être énuméré de manière exhaustive. L’idée principale dans notre projet est d’améliorer la robustesse d’une telle approche pour traiter efficacement des données incertaines et pour étendre son domaine d'applicabilité aux données génomiques et Hi-C.

Ce projet est organisé en 4 workpackages (WPs). Le WP1 et le WP2 sont axés sur les méthodologies, tandis que les WP3 et WP4 sont liés aux applications. En particulier, le principal objectif de WP1 est de définir des caractéristiques qui, étant donné les les informations RMN et la structure chimique d'une protéine, permettent de prédire une information de distance suffisamment précise pour décrire correctement les structures secondaires de la protéine. Le but principal de WP2 est de concevoir un algorithme de BP tolérant aux erreurs, qui est notamment capable de traiter des données incertaines. Le but du WP3 est d'exploiter les résultats de WP1 et WP2 afin de trouver la structure tridimensionnelle de protéines désordonnées en utilisant uniquement les déplacements chimiques en RMN, alors que WP4 appliquera les résultats de WP1 et WP2 à la génomique et aux données Hi-C.

Le coordinateur du projet a une longue expérience sur le DG s et ses applications. Ses premiers travaux sur le sujet remontent à environ 10 ans, quand il était chercheur postdoctoral à LIX (Ecole Polytechnique) sous la direction de Leo Liberti. A cette époque, la collaboration avec les scientifiques de l’Institut Pasteur ont commencé, et en particulier avec Thérèse Malliavin. Depuis lors, l'application principale sur laquelle nous nous sommes concentrés, concerne les conformations des protéines. La collaboration entre Antonio Mucherino et Jung-Hsin Lin est bien plus récente, mais est devenue plus active au cours de la dernière période grâce à un Projet CNRS PRC sur les années 2018 et 2019, qui permet aux deux partenaires de se rencontrer régulièrement et d'obtenir des progrès rapides par rapport aux idées initiales pour une collaboration.

Le consortium regroupe des scientifiques de différentes disciplines et d'horizons différents, situés en France et à Taiwan. Aucune équipe du consortium n’a une expertise similaire. Chaque partenaire recrutera un chercheur temporaire qui sera employé à plein temps sur les différents WPs du présent projet. Les autres coûts demandés sont liés à l’organisation de réunions régulières entre les partenaires (en France ou à Taiwan) et à la participation à des conférences au cours desquelles nous prévoyons de publier nos premiers résultats.

Les excellents résultats obtenus dans le contexte des données de DG avec RMN nous motivent fortement à proposer le présent projet. Si des résultats similaires sont obtenus à la fin de ce projet pour les protéines désordonnées, ainsi que pour la génomique et les données Hi-C, nous pourrons alors mettre à la disposition un outil robuste qui aura une importance cruciale dans le domaine de la biotechnologie, dans la perspective des modèles de biophysique moléculaire intégrés dans le contexte cellulaire ou génomique.

Antonio Mucherino (Institut de Recherche en Informatique et Systèmes Aléatoires)

L'auteur de ce résumé est le coordinateur du projet, qui est responsable du contenu de ce résumé. L'ANR décline par conséquent toute responsabilité quant à son contenu.

IRISA Institut de Recherche en Informatique et Systèmes Aléatoires
BIS INSTITUT PASTEUR
LIX Laboratoire d'Informatique de l'Ecole Polytechnique
RCAS Academia Sinica / Research Center for Applied Sciences
GRC Genomics Research Center of Academia Sinica

Aide de l'ANR 361 800 euros
Début et durée du projet scientifique : décembre 2019 - 48 Mois

Explorez notre base de projets financés

L’ANR met à disposition ses jeux de données sur les projets, cliquez ici pour en savoir plus.