ChairesIA_2019_2 - Chaires de recherche et d'enseignement en Intelligence Artificielle - vague 2 de l'édition 2019

Modeling and Extracting Complex Information from Natural Language Text – NoRDF

NoRDF: Extracting and modeling complex information from text

The NoRDF Project is a scientific project at Télécom Paris that aims to model and extract complex information from natural language text.


We want to enrich knowledge bases with events, causation, conditions, precedence, stories, negation, and beliefs. In particular, we will investigate the expression of sentiment.<br /><br />We want to extract this type of information at scale from structured and unstructured sources, and we want to allow machines to reason on it. The project brings together research on knowledge representation, on reasoning, and on information extraction, and aims to be useful for applications such as fake news detection, the modeling of controversies, or the analysis of the e-reputation of a company.

To allow a machine understand natural language text, we use both neural and symbolic methods. We use deep learning to judge whether two sentences contradict each other, logical reasoning to draw conclusions from these contradictions, semantic parsing to represent the meaning of the sentences, and a new formalism to reason on nested sentences.

We have so far produced mainly surveys of the state of the art. Our own techniques are under submission.

We are developing the individual components (reasoning formalism, meaning representation, information extraction), and we hope to be able to assemble them by the end of the project.

We have first produced extensive surveys on the state of the art in all domains that are relevant to the project:
• In “Combining Embeddings and Rules for Fact Prediction” (AIB 2022 tutorial paper) we survey approaches that combine symbolic and logical methods for predicting facts in knowledge bases.
• In “Reasoning with Transformer-based Models: Deep Learning, but Shallow Reasoning” (AKBC 2021 full paper) we systematically analyze the limits of current BERT-like models when it comes to reasoning.
• In “The Vagueness of Vagueness in Noun Phrases” (AKBC 2021 full paper), we study the types, frequency, and nature of vague noun phrases. We also survey current approaches to deal with such phrases.
• In “Non-named entities - the silent majority” (ESWC 2021 short paper), we do the same analysis for non-named entities.
• In “Extracting Complex Information from Natural Language Text: A Survey” (Semantic Journalism workshop at CIKM 2020) we survey approaches to extract beliefs, hypotheses, etc. from natural language text.
• In “The Need to Move Beyond Triples” (Text2Story workshop at ECIR 2020), we survey approaches to extracting complex information as well as approaches to modeling such information and reasoning on it.
• In “Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases” (Foundations and Trends in Databases 2021), we survey all major current methods for information extraction on 250 pages
• A survey on different reasoning methods is in preparation, with 180 surveyed works so far.
• A survey on quality measures of automatically generated stories has been submitted to COLING 2022
• A survey on semantic parsing approaches has been submitted likewise to COLING 2022

We have then made first steps into the extraction of complex information and reasoning on that information:
• In “Imputing Out-of-Vocabulary Embedding with LOVE Makes Language Models Robust with Little Cost” (ACL 2022) we develop a method to make language models such as BERT robust to misspellings, jargon words, or unknown words.
• A work on disambiguating acronyms in natural language text has been submitted to EMNLP 2022
• A work on textual inference with negation has likewise been submitted to EMNLP 2022

Knowledge bases (KBs) have proven indispensable in modern AI applications such as question answering, personal assistants, or fake news identification. However, today's large KBs are limited in their knowledge to simple binary RDF facts between a subject and an object. This hides the vast majority of pieces of information that humans know or care about, and cripples the usefulness of these KBs. In this proposal, we endeavor to broaden the current knowledge representation beyond RDF, and to fill it with meaningful, multifaceted facts at large scale. We aim at negated statements, belief, causal relationships, and more generally statements about statements. Applications of this new type of knowledge bases include smarter chatbots, a semantic analysis of e-reputation, an automatic understanding of controversies, and the fight against fake news.

Project coordination

Fabian Suchanek (Institut Mines-Télécom)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


LTCI - Télécom Paris Institut Mines-Télécom

Help of the ANR 441,720 euros
Beginning and duration of the scientific project: August 2020 - 48 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter