Stastistical learning to decipher secretion systems in genomes – SECRET
Functional annotation is often performed by phylogenomics approaches or machine learning approaches. On the one hand, phylogenomics approaches rely on quantifying sequence similarity and genomic context between the protein of interest and proteins for which functional annotation is available. Particularly powerful to annotate complexes of proteins, these methods often require a lot of manual work to identify patterns specific to each system and are limited to annotating proteins with known homologs. On the other hand, machine learning approaches are based on the extraction of features from the protein sequence (e.g. secondary structure, physico-chemical properties, etc): They thus overcome some limitations of phylogenomics approaches of relying on sequence similarity. But, they often cast protein function annotation as a "one-protein, one-function" prediction task with broad functional categories. There is thus a gap between genome-aware phylogenomic and protein-centric machine learning approaches.
The objective of this proposal is to develop statistical learning approaches to predict protein function in prokaryotic genomes with an application to identifying novel secretion systems. The diversity of secretion system types in terms of size (1 to 15 proteins), proteins involved, and genomic context, as well as their importance for prokaryotes make them the perfect case study for our project. Our goals are three-fold: (1) identify secretion system protein signatures beyond sequence similarity (secondary structure, physico-chemical properties); (2) overcome the limitation of machine learning approaches of working at the level of the gene/protein by developing semi-supervised, penalized deconvolution methods; (3) automate as much as possible all learning tasks to solve them without human interaction. This interdisciplinary project will thus develop novel statistical learning tools to annotate and discover known and unknown secretion systems in prokaryotic genomes.
Madame Nelle Varoquaux (TIMC)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Help of the ANR 214,248 euros
Beginning and duration of the scientific project: March 2023 - 42 Months