Coclico is a research project whose the aim is to propose a generic method allowing multi-scale analysis of large volumes of spatio-temporal data of variable quality provided in continuous. It consists in studying and implementing an incremental multistrategy approach guided by knowledge. It must guarantee a final quality of the results by taking into account the quality of the data and those of knowledge.
The Coclico project aims to develop automatic or semi-automatic methods, tailored to the complexity and rapid evolution of large masses of multi-source spatiotemporal data, relying on advanced methods from data mining and the machine learning for the analysis and monitoring of complex phenomena. It must therefore meet several challenges, including:<br /><br /> The data volumes are huge and the problem of scaling algorithms is particularly important for the development of incremental approaches allowing for continuously updating models.<br /> The data may contain errors or aberrations whose detection and consideration in the analysis process are complex.<br /> Natural and anthropogenic processes are complex and constantly evolving, the data are dynamic.<br /> The analysis is done in several semantic levels: it must indeed be able to conduct an analysis on both the global and local levels and articulate these levels.<br /> The knowledge about both the phenomena and processes under study and the methods to be implemented for this purpose are complex and barely formalized. <br /><br />
Four original facets :
The method will be multi-strategy and multi-scale. We propose to extend collaborative methods in order to be able to use new families of algorithms (segmentation, ranking ...). The goal is to improve the quality of results and enable a multi-scale data analysis
It will be incremental. It is no longer possible to reconstruct knowledge from scratch from the phenomenon under study at each new data. We propose to implement an incremental method for the comparison of the extracted knowledge to new experimental results or new hypotheses about the data. The objective is to allow for a continuous questioning of the extracted knowledge to meet the specific needs of the scientists and the thematicians.
It will be guided by knowledge. To minimize the involvement of the user during the process, it is necessary to use her knowledge of the entities and their mutual relations, define their representations and mechanisms for their extraction and recognition. We propose to study and implement such a knowledge base. The goal is help guide, but also to challenge, the collaborative process based on this knowledge
It will be guided by the quality of data and knowledge: We propose to study and implement a method integrating a knowledge base with the collaborative process itself, enabling to choose the best data to be processed according to their own and related qualities but also to select the most appropriate pre-treatment methods, the «best« mono-strategy methods, and the best configuration for multi-strategy collaborative for the latter. The goal is to make the method robust to noise and errors in the formalization of domain knowledge
Will be produced ......
Will be produced ......
Will be produced ......
Data mining is an important step in the process from data to knowledge. Thus, for example, understanding the processes and development of systems, more or less anthropic, in various spatial and temporal scales (urbanization pressure on land, biodiversity loss etc..) from satellites or other data becomes a major component in various areas such as the study of the environment or urbanism. But the current analysis techniques are less and less able to address the current avalanche of heterogeneous data often incomplete or inaccurate and increasingly supplied as continuous streams.
But if the characteristics of mining methods are generally well known and understood by the analyst-statistician or computer scientist, it is rarely the same for the user. Thus, quite often it is necessary to try several algorithms with different parameters to determine which best suits the question. The user must also take into account the indeterminacy of many unsupervised classification methods. Moreover, it is necessary to take into account the variable quality of raw or preprocessed data, the robustness of learning methods to noise, and the sensitivity of results to changes in methods or parameters of data acquisition / construction, in order to suggest more appropriate strategies for data cleaning and preprocessing. Finally, the data being supplied continuously, a dynamic dimension and the need for incremental learning ability in a changing environment are added.
There are currently no surefire way to choose the best method and its parameters, as this choice is strongly related to the application domain and a priori knowledge on it and the data to be processed. One approach increasingly proposed to circumvent this problem is based on the intuition that the methods are complementary or at least can corroborate among themselves. Thus, mechanisms of confrontation and unification of results from different methods and data can be used to provide the user with a relevant summary. A promising avenue in this area is based on collaboration between different methods.
Nevertheless, we learn even better than what we address relates to what we already know and that the objective of the task is known and understood: it is not desirable that data interpretation is done by a person ignorant of the topic. Thus, the interpretation process often requires the presence of a thematic expert, but is unfortunately very time-consuming. Though reducing that by introducing direct involvement of the expert knowledge in this process requires modeling and formalizing classes / objects in the real world, to define their possible representations in the data space and finally to study and build mechanisms for extracting and labeling these objects with respect to this knowledge.
Coclico is a research project to study and propose a generic method for an innovative multi-scale analysis of large volumes of spatio-temporal data provided as a stream of highly variable quality, implementing a multi-strategy approach in which incremental collaboration between different data mining methods will be guided by knowledge of both the thematic field (Geosciences, Geography) formalized in ontologies and of the domain analysis (knowledge of the methods), and guaranteeing a objective of final quality taking into account both the quality of data and of knowledge.
Monsieur Pierre Gancarski (Laboratoire des Sciences de l'Image de l'Informatique et de la Télédétection) – firstname.lastname@example.org
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
LIPN Laboratoire d'Informatique de Paris Nord
AgroParisTech / INRA AgroParisTech / INRA
ESPACE DEV ESPACE DEV
LIVE Laboratoire Image, Ville, Environnement
LSIIT Laboratoire des Sciences de l'Image de l'Informatique et de la Télédétection
Help of the ANR 1,018,721 euros
Beginning and duration of the scientific project: October 2012 - 48 Months