Blanc SIMI 2 - Blanc - SIMI 2 - Science informatique et applications

Data Mining and natural language processing cross-fertilization – Hybride

Data mining and natural language processing cross-fertilisation

The Hybride research project aims at developing new methods and tools for supporting knowledge discovery from textual data by combining methods from Natural Language Processing (NLP) and Knowledge Discovery in Databases (KDD). A key idea is to design an interacting and convergent process where NLP methods are used for guiding text mining and KDD methods are used for analysing textual documents.

Context, positioning and objectives of the proposal

Hybride is a fundamental research project which aims at developing new methods and tools for supporting knowledge discovery from textual data with an important application on rare diseases. Accordingly, the key idea of Hybride is to combine both research activities for building an interacting process where NLP methods are used for guiding KDD and KDD methods are used for analysing textual documents. One type of methods provide elements to the second for managing large collections of texts and solving problems in text analysis, text annotation, and text mining. For example, NLP methods applied to some texts locate ``textual information'' that can be used by KDD methods as constraints for focusing the mining of textual data. By contrast, KDD methods can extract itemsets or sequences that can be used for guiding information extraction from texts and text analysis. The combination of NLP and KDD methods for common objectives, e.g. analysing a large collection of textual documents for answering complex questions on information retrieval and text mining, can be considered as a «virtuous circle«, i.e. a sequence of complex operations from NLP and KDD that reinforces itself through a feedback loop. This project is supported by an important application on rare diseases, provided by the French leader organisation for knowledge access on rare diseases.

An example is used to illustrate and justify the four main tasks underlying the Hybride project and as well the cooperation between these tasks for combining NLP/KDD methods. Orphanet experts are in charge of compiling a large set of articles for writing a synthesis. Then the different types of information which are needed have to be identified and extracted.
This is the role of Task 2 to introduce in the abstracts the linguistic information necessary for selection, identification, and information extraction.There is a need for named entity recognition (i.e. a gene), identification of relations...
The mining of the texts is performed using pattern mining studied in task 3. An interaction between Task 2 and Task 3 ensures an incremental enrichment of the extracted information: at each step, new linguistic constraints may be required for guiding and improving sequence mining. Indeed, a special emphasis will be put on the detection and description of symptoms and their temporal evolution. This is related to «diagnostic wandering« which is a major challenge for rare disease identification.
The patterns mined within Task 3 are interpreted as rules guiding an information extraction tool (e.g. Gate). Actually, this is the preparation of the data for Task 4. The information units extracted from abstracts are encoded within formal contexts to be used by Formal Concept Analysis. Different types of patterns are used here, such as pairs objects -- attributes (FCA), triples object -- relation -- object.
Abstracts can also be considered as graphs and then processed by subgraph mining or graph pattern structures. The output of Task 4 is a knowledge model based on a concept hierarchy used either for annotating or classifying texts. Finally, the concepts and the knowledge model built in task 4 can be queried in Task 5, following the needs of experts. The purposes of Task 5 are experimentation and validation in real-world conditions.

Hybride is a fundamental research project so that an important part of the dissemination of its results will be performed through scientific communications in high-level conferences and journals. As the project is strongly supported by the domain of rare diseases, we have also planned an impact on the application and publications in medical informatics.
The platform (see Task~5) will be a major result of the project. The Hybride platform should become a shared platform for text mining and medical experiment. There exist several open source information extraction platforms such as GATE, UIMA or LinguaStream. Hybride will be based on one of these platforms proposing learning algorithms to configure the information extraction process, as well as tools for knowledge extraction and synthesis. Our aim is that the Hybride platform becomes an environment for testing and exploiting integrated methods in text analysis, text annotation, and text mining. The platform will be developed using an Open Source licence and will be available through web sites, including the Orphanet web portal.

Among benefits to academic community, following the cross-fertilization approach, Hybride will reinforce links between the Natural Language Processing community and the Data Mining one. Even if isolated cooperations exist, a reinforcement of these collaborations should benefit to both domains. NLP and linguistic will thus benefit from data mining tools in which linguistic constraints could be introduced making easier exploration of a large corpus of texts. Selecting relevant information in a data mining process is still an open issue. Thus, learning and encoding linguistic information as constraints and then adapting data mining tools to work under these constraints is a promising approach to deal with this challenge.
Hybride will also benefit to rare disease studies in collecting in scientific abstracts elements for filling information files about particular diseases or medical events. Such information is crucial to identify the symptoms of the disease and thus the diagnostic. Currently, the lack of this information leads to the misdiagnosis (practician cannot identify the patient disease) and delays the identification of the appropriate treatment.

Clearly, results could be published in main conferences on NLP, Knowledge Discovery as well as medical informatics conferences such as AIME or MEDINFO and journals.

The Hybride research project aims at developing new methods and tools for supporting knowledge discovery from textual data by combining methods from Natural Language Processing (NLP) and Knowledge Discovery in Databases (KDD). A key idea is to design an interacting and convergent process where NLP methods are used for guiding text mining and KDD methods are used for analysing textual documents.

NLP methods are mainly based on text analysis, and extraction of general and temporal information, while KDD methods are based on pattern mining, e.g. itemsets and sequences, formal concept analysis and variations, and graph mining. For example, NLP methods applied to some texts locate ``textual information'' that can be used by KDD methods as constraints for focusing the mining of textual data.

By contrast, KDD methods can extract itemsets or sequences that can be used for guiding information extraction from texts and text analysis. This combination of NLP and KDD methods for common objectives, can be viewed as a ``virtuous circle'', i.e. a sequence of complex operations from NLP and KDD that reinforces itself through a feedback loop.

Experimental and validation parts associated with the Hybride project are provided by an application to the documentation of rare diseases in the context of Orphanet.

The fundamental aspects of the \acro project can be understood through the main steps of the knowledge discovery loop with a NLP/KDD perspective:
(i) data preparation,
(ii) data mining,
(iii) interpretation and validation of the results,
(iv) knowledge construction.
At each step, new methods have to be designed for achieving this interrelated NLP/KDD loop.

The consortium has gained a rather good experience on NLP and KDD, but efforts are still necessary for adapting the classical KDD loop to become an actual NLP/KDD loop.

There is a need to solve interaction problems at each steps of the NLP/KDD loop where interaction amounts for one process to prepare the application of the second.

Finally, a system integrates the operations involved within the whole loop, in the context of Orphanet for text analysis and production of new documentation on rare diseases.

The implementation of such a system combines various interrelated aspects, namely natural language processing, knowledge discovery, data
mining, and knowledge engineering. This original combination still remains a challenge in computer science.

Project coordination

Yannick Toussaint (INRIA - Centre Nancy Grand-Est) – yannick.toussaint@loria.fr

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

INSERM INSERM - DELEGATION PARIS VI
MoDYCo CNRS - DELEGATION REGIONALE ILE-DE-FRANCE SECTEUR OUEST ET NORD
GREYC UNIVERSITE DE CAEN - BASSE-NORMANDIE
INRIA NGE INRIA - Centre Nancy Grand-Est

Help of the ANR 485,505 euros
Beginning and duration of the scientific project: - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter