DS04 - Vie, santé et bien-être

Modeling Alternative Splicing and its Structural Impact during eVolution – MASSIV

Evolution and Structural Impact of Alternative Splicing

Alternative splicing enriches the protein repertoire by generating multiple transcripts from the same gene. MASSIV is at the cross-talk of genomics and molecular modeling. MASSIV will collect, mine and integrate the large body of sequencing data to better describe the contribution of AS to the evolution of protein structures and functions. Our working hypothesis will be that the emergence of new AS transcripts correlates with reshaping of the fold space and rewiring of interaction networks.

Systematic assessment of AS structural impact and the way it shapes protein folds and interactions through evolution

The main ambition of the MASSIV project is to exploit the huge and growing body of data available from high throughput sequencing to shed light on the molecular mechanisms by which AS evolution promotes protein functional diversity. Our objectives are the following:<br />(A) Collect, curate and integrate various types of information related to the effect of AS on the protein coding sequence (from RNA-Seq, ribosome profiling or Ribo-Seq, mass spectrometry, X-ray and NMR, biochemical/functional experiments),<br />(B) Reconstruct plausible evolutionary scenarios explaining the transcripts observed in several species (e.g. human, chimpanzee, mouse, zebrafish, drosophila) and detect evolutionarily conserved transcripts,<br />(C) Characterize the impact of AS on the protein structure, its dynamical behaviour and interactions and relate it to protein function,<br />(D) Identify the appearance of new transcripts in evolution, estimate their evolutionary ages and quantitatively relate them to the induced structural changes, in order to get insight into how protein structures/folds have evolved.<br /><br />The MASSIV project addresses a number of unresolved and challenging questions linked to AS: What are the evolutionary paths leading to functional innovation? How does AS impact the fold diversity of a protein or protein family? Can proteins evolve completely new folds? Does AS modulate PPI networks and to what extent? What are the molecular mechanisms underlying different biochemical activities or binding affinities of different protein isoforms? To contribute to answer them at large scale, we will devise efficient and accurate strategies to predict the structural/dynamics changes associated to ASE-induced sequence variations and to reconstruct transcripts’ phylogenies across several species. The wealth of sequences and the high-dimensionality of protein fold space places this project in the big data category. The recent development of high performance computing resources makes it feasible.

We will develop theoretical models with quantitative predictive power for transcript phylogenies and isoform energy/mobility/binding landscapes reconstruction. This implies collecting, treating and integrating RNAseq and potentially ribose data, identifying orthologous exons across species (by a combination of pairwise and multiple sequence alignment), inferring plausible evolutionary scenarios by reconstructing forests of transcript trees embedded in the gene tree, and modeling and annotating the 3D structures of the produced protein isoforms. As a result, we will produce a fully automated open-source computational package implementing and integrating the models.

We will generalize these computational tools to apply them to the whole human genome. This will produce a database of evolutionary and structural annotations for transcript isoforms that will be made freely available to the scientific community through a web server, along with associated web services, and will be regularly updated.

We will combine big data management and knowledge creation towards the elucidation of the link between the fixation of ASEs and their structural and functional impact on the protein repertoire. For each protein family, we will map the estimated structural changes onto the leaves and ancestral nodes of the transcripts’ phylogeny. We expect that our analysis will permit to identify new therapeutic targets, i.e. specific isoforms whose expression is correlated with the appearance of diseases.

A key asset in our approach is the cross-talk between sequence-based phylogenetic inference and molecular modeling. Consistency between the estimated structural/dynamical/binding changes and the detected evolutionary conservation of ASEs already guarantees the soundness of our results and serve as a form of validation of our predictions. We will also constantly seek for experimental data to confront our predictions and validate them.

We have developed a phylogeny reconstruction algorithm based on the maximum parsimony principle. It takes as input a set of transcripts, represented as collections of exons, and infers phylogenetic tree forests included in the gene tree. For automatically reconstructing phylogenies on a large number of species, we have also developed ThorAxe, a method that automatically identifies groups of orthologous exons from annotated transcripts data from Ensembl. To the best of our knowledge, this is the first tool to automatically define orthologous exon groups taking into account alternative splicing. His grip is very easy. It can be used for a wide spectrum of applications (annotation, prediction of exons, identification of genes containing an exon, identification of similar transcripts in different organisms, design of primers for exon-exon junctions, evolutionary age measurement each exon ...). We have also developed a homology modeling prototype to reconstruct 3D structures of protein isoforms. This modeling is based on the exons identified by Thoraxe. We iteratively search for homologous templates until the target is fully covered. We have developed a knowledge base for about twenty gene families. Sequence, structural, biochemical and functional data were compiled. We were able to verify that all events are authentic and correspond perfectly to those documented in the literature. Beyond the validation interest of our methods, it helps to develop a global vision of the types of ancient alternative splicing events that have a functional impact, the types of associated protein structures and the types of molecular mechanisms involved. This global vision is currently lacking in the community. In parallel, we have developed a prototype for the integration of RNA-seq data in phylogeny reconstruction, in order to be able, in the long term, to carry out tissue-specific analyzes and reannotation of undocumented transcripts.

We will now finalize the development of our tools for predicting structures and apply them to our entire knowledge base. We will also reconstruct phylogenies of transcripts for the entire benchmark. The results will be the subject of a manuscript.

We will then apply ThorAxe to the entire human proteome. We will develop a tool to automatically detect ancient AS events from the results of ThorAxe and the structures of all human isoforms of ancient origin will then be modeled. These data will be made available to the community. We will also finalize our sequencing data analysis and integration tool (RNA-seq, Ribo-Seq, scRNA-seq), which will feed the analysis of the entire proteome.

We also plan to organize a 2-day conference around AS at Sorbonne University in spring 2020. The different sessions will cover a broad spectrum of themes: Evolution, EA and disease, Quantitative data on EA, Functional Analysis of EA , ontologies, Structure and function.

2 research articles in international journals (1 submitted, 1 in preparation), 1 oral presentation (+ poster + travel fellowship) in the reference international conference of bio informatics (ISMB/ECCB 2019, about 1000 participants), 2 oral presentations, 1 poster et 1 demo in the reference French conference of bio informatics (JOBIM, 2018 and 2019 editions).

Alternative splicing (AS) greatly contributes to functional diversity in multicellular eukaryotes. It augments and enriches the protein repertoire by generating multiple transcript isoforms from the same gene. In Human, AS affects almost all multi-exonic genes and its deregulation is associated to diseases like cancer.

Although the AS mechanisms are well described at the genomic level, the impact of AS events on protein structures has been seldom characterized. Molecular modeling studies have highlighted cases where AS events (ASEs) may induce large structural changes, even fold switches, and rewiring of protein-protein interactions (PPI). This suggests that evolution makes use of AS to produce structural and functional diversity. However, the extent to which the transcript diversity generated by AS translates at the protein level and has functional implications in the cell remains a very challenging question and has been subject to much debate.

We propose MASSIV, a multidisciplinary bioinformatics project at the cross-talk of genomics and molecular modeling. MASSIV will collect, mine and integrate the large body of data available from high-throughput sequencing (HTS) to provide the community valuable biological knowledge on the contribution of AS to the evolution of protein structures and functions. Our working hypothesis will be that the emergence of new AS transcripts during evolution correlates with reshaping of the protein fold space and rewiring of the PPI networks.?

MASSIV will rely on the development and application at large scale of computational methods that will be carefully validated. We will develop the first computational method that reconstructs plausible scenarios to explain a set of transcripts across a set of species and predicts the tertiary structures, dynamical properties and interactions of the corresponding isoforms. This will enable to determine which isoforms are likely to play a functional role in the cell and to provide mechanistic explanations for the functional outcome of ASEs. We will apply our methodology to the whole human genome. This will generate a knowledge base that will be made available to the community. Mining the base will enable to identify ASEs that induce major structural changes and learn about how AS navigates in the protein fold space along evolution.

Expected results will also open new avenues in medicinal research (identification of new therapeutic targets, creation of patient-specific signatures). The developed methods will have broad applicability and will be useful to study transcript diversity and conservation among diverse biological entities. The entities could be at the scale of (i) one individual/species (tissue/cell differentiation), (ii) different species (matching cell types) or (iii) population of individuals affected or not by a multifactorial disorder. The latter case is particularly relevant in the context of medicinal research.

MASSIV proposes to exploit in a rational and efficient manner the large body of data generated by HTS technologies and to complement experimental approaches dedicated to transcript diversity analysis with computational methods. The formidable development of such approaches is quite recent and we are just starting to be able to survey the AS-associated complexity across different species, individuals and tissues in depth. We can expect that in a decade or two, recording the transcriptome of any individual in a given cell type will be routinely feasible personalized medecine will become more and more available. It is important that we engage right now methodological efforts to be able to treat those personalized data in the best possible way. And this is what MASSIV is dedicated to. We believe the methods/approaches we will conceive and develop will become instrumental as experimental evidence accumulates and precise quantitative data become available.

Project coordination

Elodie Laine (Laboratoire de Biologie Computationnelle et Quantitativ)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LCQB Laboratoire de Biologie Computationnelle et Quantitativ

Help of the ANR 205,200 euros
Beginning and duration of the scientific project: December 2017 - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter