DS0704 -

Graph-based machine learning for linguistic structure prediction – GRASP

Submission summary

Within the last two decades, Natural Language Processing has made dramatic advances due primarily to the intensive use of supervised Machine Learning methods like structured output prediction. But NLP is now beset with a new set of challenges that arise from the constant evolution of the Web and the coming of age of Big Data. Thus, the growth of the Web and the continual emergence of new communication media (like online forums and micro-blogging platforms) have created a much wider diversity of text data in terms of domains, languages, and styles. This diversity is a byproduct of the ever increasing amount of text data which are now present on the Web. As the ``Big Text data'' keeps becoming bigger, it makes need for better retrieval, navigation, and summarization tools all the more urgent, and with it the need for much deeper linguistic analysis. In particular, we need more accurate tools for semantic and pragmatic tasks, such as coreference resolution, temporal ordering prediction, and discourse parsing. Unfortunately, these problems have not yet benefited much from recent advances in Machine Learning, and most systems thereof still rely on linguistically-motived heuristics and heavy feature engineering. Even the recent surge of "deep" learning models has so far not brought about much improvement on these tasks.


The main research goal of the GRASP project is to develop a new generation of NLP systems that are better equipped to address these new challenges. The new systems we propose will embody two important shifts. On the one hand, they will rely on ML algorithms that require less annotated data and are able to effectively leverage the large amounts of unlabeled text data that are at our disposal. Accordingly, this project intends to focus on various learning scenarios that include little or no human supervision, such as unsupervised and semi-supervised learning as well as cross-domain and cross-lingual transfer learning, for several NLP tasks. On the other hand, we need to find more adequate ML formulations for semantic and pragmatic tasks, in particular formulations that are better suited to deal with very high-dimensional input (and output) structures and to bypass the lack of wide-coverage static knowledge bases through the use of unlabeled texts.

These two shifts required on the part of NLP will be addressed within the unifying framework of Graph-based Machine Learning, a recent framework that combines insights from graph theory, linear algebra, and machine learning but that remains largely unused by the NLP community. We identify two main current limitations of this framework in its applicability to NLP problems, which in turn motivates our two main research strands: (i) the integration of graph-based propagation and regularization methods with structured output prediction models, and (ii) the development of graph construction algorithms that take into account the specific learning objective. These two extensions are strongly complementary, as they each correspond to two distinct phases of graph-based learning: label inference and graph construction, respectively. By pursuing these two objectives, we will in effect build bridges between three important areas of machine learning that have so far have been considered distinct and rather autonomous subfields, namely: graph-based machine learning, structured output learning, and metric learning.

Through these new important cross-fertilizations between NLP and ML techniques, we anticipate that this project will significantly advance of the state-of-the-art in statistical NLP. It will pave the way for the development of systems that provide deeper analysis, that are better performing, and much more versatile to diverse types of input text data, hence more attuned to the needs of today's information society.

Project coordination

Pascal Denis (Inria Lille - Nord Europe)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Inria Inria Lille - Nord Europe

Help of the ANR 247,270 euros
Beginning and duration of the scientific project: October 2016 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter