Today data is being generated at an unprecedented rate. Despite the phenomenal data growth, the human ability to comprehend data remains as limited as before. Therefore, the “Big Data” era is faced with an increasing gap between the growth of data and the human ability to comprehend the data – this gap will inevitably prevent Big Data from delivering on its promise, especially to emerging big data communities, such as healthcare and scientific computing, where the analytical needs are immense while technical know-how is limited in the workforce.
To bridge the gap, we propose Interactive Data Exploration as a new service of a database management system (DBMS), and a large suite of new algorithms and optimizations to support this service effectively across science, healthcare, and business. Our research advocates a new approach of DBMS-aided data exploration and automatic learning of the user interest to retrieve all objects that match the user interest. Under this approach, the DBMS leverages user relevance feedback on database samples to model the user interest, and makes strategic “exploration decisions” over its large content to collect the best samples to expedite the model convergence. Our research is expected to advance the state-of-the-art of (1) learning theory for interactive data exploration, with new exploration strategies over large databases and provable results of the model convergence rate, and (2) DBMS design, including many new query processing and optimization techniques to support exploration workloads with interactive performance and high scalability.
We anticipate that this project will generate substantial scientific, social, and economic impacts.
- Scientific: This project will enable the dissemination of a formal approach to data exploration grounded in a rigorous learning framework, as well as algorithms and optimization techniques in the DBMS to ensure interactive performance and scalability. These results will be disseminated via scientific publications, system demos, visits to relevant industry research labs, and release of open-source code to the scientific community. Furthermore, our project will enable a close synergy between database systems, machine learning, and visualization, and help prepare a larger ERC submission in the future. It will also integrate research and education closely by timely translation of our research results to classroom teaching.
- Societal and Economic: Our new DBMS service of automated data exploration will be crucial for deriving insights from large, complex datasets encountered in many big data applications across science, healthcare, and business. Human effort of data exploration on large datasets will be much reduced, as the user will be methodically steered towards his true interest, and yet the quality of exploration will be significantly improved, as such exploration is grounded in a rigorous learning framework and formal methods with provable results. In particular, our anticipated collaboration with CNAMTS has the potential to bring direct benefits to the French healthcare sector, leading to both societal impact from a biomedical standpoint, and economic impact given that public health is the first budget of France.
Madame Yanlei Diao (ECOLE POLYTECHNIQUE)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
ECOLE POLYTECHNIQUE ECOLE POLYTECHNIQUE
Help of the ANR 299,716 euros
Beginning and duration of the scientific project: September 2016 - 48 Months