CE45 - Mathématiques et sciences du numérique pour la biologie et la santé

Learning from pangenomes with infinite collections of sequence motifs – PIECES

Submission summary

Genetic variation can have causal effects on a variety of phenotypes
ranging from human health risks to bacterial drug resistance and crop
yield. Unraveling the relationship between genotypes and phenotypes is
therefore crucial for both basic and applied science. Genomes have
historically been treated as small variations around a reference
sequence in computational biology and statistics. Genome Wide
Association Studies (GWAS) for example typically start by aligning the
genomes of all samples in a panel against a reference genome. Each
sample is then represented by its set of point mutations, and typical
methods test the statistical association between the presence of a
mutation and a phenotype of interest. In many important cases however,
alignments are not appropriate. Microbes for examples sometimes have
entire genes which are not present in all individuals. Most
alignment-free representations rely on the exact presence of
sub-sequences in the genomes. However, genomic variants are often
better described in terms of sequence motifs, indicating frequencies
of each letter at each position. The recently introduced CKN-seq
method implicitly defines infinite sets of genomic features akin to
sequence motifs, and selects the ones that are most relevant for a
learning task. The PIECES project will extend CKN-seq and exploit its
ability to represent unaligned sequences through three tasks:

* GWAS over infinite sets of sequence motifs

CKN-seq selects sequence motifs from an infinite set, based on their
ability to predict a phenotype. However, no procedure exists to
quantify the significance of the association between the selected
motifs and the phenotype. We will propose versions of CKN-seq that are
amenable to hypothesis testing, allowing their use for GWAS over sets
of sequence motifs. The testing procedure will build on
selective inference, a recent and active field in statistics. The
resulting GWAS method will be used in several already established
collaborations with microbiologists to identify genetic determinants
of antimicrobial resistances in bacterial genomes, and with an
industrial partner interested to detect determinants of human diseases
in gut microbiomes.

* Alignment-free, interpretable sequence analysis

The rapidly increasing availability of biological sequence data also
calls for the development of exploratory methods. Most unsupervised
learning methods typically require sequences to be aligned and are
often slow. On the other hand, existing kernel methods deal with
unaligned sequences and are suited to efficient approximations but
lose access to the features used to perform the analysis, only
returning cluster memberships for clustering or sample projection for
PCA. An important challenge is therefore to provide exploratory
methods which are fast and alignment-free while remaining
interpretable. Accordingly, we will develop an unsupervised version of
CKN-seq performing PCA and clustering, making it possible to interpret
clusters or principal components in terms of associated motifs.

* Learning from populations of sequences

We will explore a supervised learning approach to phylogenetic
reconstruction: rather than maximizing the likelihood of a model of
sequence evolution, we will generate evolutionary trees and sequences
from these models and use them to learn a function transforming
(observed) distances between sequences into distances on the
(unobserved) evolutionary tree. This novel paradigm could improve over
existing phylogenetic reconstruction methods, or lead to similar
accuracies on much larger sets of species for which existing methods
are computationally prohibitive. The supervised learning will be able
to exploit both aligned sequences and a novel alignment-free
representation of the gene family relying on the same principle as
CKN-seq (which deals with individual sequences).

We will deliver user-friendly software to maximize the diffusion and
impact of successful methods for all three tasks.

Project coordination

Laurent JACOB (BIOMÉTRIE ET BIOLOGIE EVOLUTIVE)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LBBE BIOMÉTRIE ET BIOLOGIE EVOLUTIVE

Help of the ANR 380,071 euros
Beginning and duration of the scientific project: December 2020 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter