DS10 - Défi des autres savoirs 2016

Post hoc approaches for large-scale multiple testing – SansSouci

Submission summary

The number and size of available data sets of different types has increased dramatically over the past twenty years. This "data deluge" has been accompanied by a shift from hypothesis-driven research to data-driven research in many scientific fields including astronomy, biology, genetics, or medicine. The size and specific characteristics of these data require dedicated tools to be developed for their analysis. In a number of applications including genomics and neuroimaging, many features are tested for their association to a response variable of interest (a phenotype). This multiple testing situation has triggered the development of specific risk measures such as the widely used False Discovery Rate (FDR). We have identified situations in which there exists a substantial gap between the statistical guarantees provided by state-of-the-art multiple testing procedures and the actual needs of practitioners. We consider three motivating examples:

1. Neuroimaging: detection of regions of the brain specifically activated when performing a given task from functional magnetic resonance imaging (fMRI) data.

2. Differential gene expression analyses in cancer studies.

3. Genome-Wide Association Studies (GWAS): identification of genomic markers associated with a phenotype of interest.

In these three examples, multiple significance testing is used as a first, exploratory step in order to define a list of candidates. This list is then refined and/or interpreted using prior knowledge on the problem at hand. Two major limitations to this type of two-step approaches are that (1) the initial selection does not take advantage of the available prior knowledge, and (2) no formal risk assessment can generally be made on the resulting set of markers. Without dedicated statistical approaches, practitioners run the risk of frequently reporting spurious findings as "significant". The statistical community recognizes the problem of poor reproducibility of the scientific research.

In order to overcome the above-mentioned limitations of existing multiple testing procedures, the SansSouci project aims at developing mathematically grounded procedures for post hoc inference. By post hoc, we mean that the set of hypotheses to be selected may be defined by the user of the procedure after 'seeing the data'. In particular, these candidate lists may be defined after significance testing has been performed. Therefore, contrary to the risk measures currently used, post hoc approaches enable valid statistical statements to be made simultaneously on any number of arbitrary candidate lists. Post hoc approaches to multiple testing thereby constitute a major paradigm shift in the field of multiple testing, with considerable potential impact on applications to high-throughput data analysis. This is particularly relevant in the above examples, where prior knowledge or complementary data analysis suggests focusing on a set of candidate hypotheses R that does not correspond to "the |R| most significant hypotheses".

The main objective of this project is to develop innovative multiple testing procedures for post hoc inference. In order to do so, we introduce a novel, dedicated risk measure, which we call Joint Risk (JR). JR control makes it possible to build truly post hoc inference procedures. This project covers:

1. theoretical aspects of JR control: the identification of statistical settings which enable this risk measure to be (asymptotically or non-asymptotically) controlled; the development of JR controlling procedures adapted to these settings; the characterization of their statistical properties;

2. the development of computationally-efficient post hoc inference procedures tailored to the above-described applicative problems;

3. their application and evaluation on specific instances of these problems;

4. their implementation and diffusion among practitioners via software packages and dedicated graphical user interfaces.

Project coordination

Pierre NEUVIAL (Université Toulouse III Paul Sabatier - Institut de mathematiques de Toulouse )

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

IMT Université Toulouse III Paul Sabatier - Institut de mathematiques de Toulouse

Help of the ANR 192,834 euros
Beginning and duration of the scientific project: September 2016 - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter