DS0705 -

Découverte de schémas complexes dans les bases de connaissances – DICOS

Discovery of Complex Schemas for RDF Knowledge Bases

Recent years have seen the rise of large knowledge bases such as DBpedia, YAGO, Freebase, and Google’s knowledge graph. The advance of the Linked Open Data project, which now contains thousands of knowledge bases, is a case to the point. These knowledge bases use RDF and are thus inherently schema-less. We propose to use rule mining to deduce schema constraints automatically from the data.

Mining schemas automatically

Building on recent advances in the field, we propose to enlarge the scope of automated rule mining to <br />numerical and existential rules. The resulting constraints could be used to spot errors in the data or even <br />to predict missing pieces in the knowledge. The particular challenge in the context of knowledge bases <br />is the absence of counterexamples, which requires a new approach to mining rules.

Our central insight is that logical rules provide a general and expressive framework to
mine all of these aspects together. Logical rules take the form
type(x,movie) ? stars(x,y) ? type(y,actor)
Such rules typically come with a weight or confidence score. They can be mined efficiently from large
KBs [48, 50]. If the rule language could be extended to numerical attributes, existential rules, and
negated atoms, then we would be able to mine much richer constraints than previously possible with
the approaches in isolation. We could discover, e.g., that people who teach at a university usually have a
doctoral degree, that ISBNs should be unique, that a year together with a title identifies a movie uniquely,
or that race cars are generally faster than standard cars. Second, these rules could be used to spot and
eliminate erroneous facts. For example, we could automatically detect that “Titanic (movie)” should be
classified as a movie and not as a ship, because it has actors; we could check the taxonomy to make sure
that more general classes (with fewer attributes) include more specific classes (with more attributes); or
we could detect that a birth date must be wrong because it appears after the death date. If we are able to
mine that people usually have no more nationalities than those mentioned in the KB, we could find areas
where the KB is “complete” and where it is not. Learning rules that estimate the completeness of a KB
could open up new ways of reasoning and evaluation, and this topic has not yet been touched by current
research. If such regularities could be found and exploited automatically, that would mean a huge step
forward for the community.

The work progressed on four axes :

Our idea is to spot mistakes in the data of Wikidata, to learn how the contributors of Wikidata corrected these mistakes in the past, and to propose to correct similar mistakes on the current data. This falls in the Work Package 1 of the DICOS project, “Mining Dependencies”.

Dynamic Knowledge Bases
We work on schema mining on dynamic knowledge bases (i.e., knowledge bases that are accessible only through Web services). The idea is to pin down all queries that can be answered given the services. This amounts to a characterization of the part of the knowledge base that is accessible from the outside. This is a special case of Work Package 1 that we decided to treat because it has a clear usecase.

Conditional Key Mining
The idea is to mine constraints that identify an entity uniquely in a certain context. For example, a German PhD student can have only a single advisor. Thus, the student uniquely identifies the advisor – but only in Germany. This work falls in the Work Package 1 as well. I have worked in this project with two colleagues, one PhD student, and one postdoc.

Mining of obligatory attributes
We work on mining obligatory attributes in knowledge bases. The idea is to find out whether an attribute (e.g., “hasNationality” or “isMarried”) applies to all instances of a class in the real world (i.e., whether all people are married in the real world) – given only the incomplete knowledge of the knowledge base. This falls in the Work Package 2 of the DICOS project, “Mining Rules with Existential Quantifiers”.


Thomas Pellissier Tanon, Camille Bourgaux, Fabian M. Suchanek:
“Learning How to Correct a Knowledge Base from the Edit History” (pdf)
Full paper at the The Web Conference (WWW), 2019

Jonathan Lajus, Fabian M. Suchanek:
“Are All People Married? Determining Obligatory Attributes in Knowledge Bases” (pdf)
Full paper at the Web Conference (WWW), 2018

Danai Symeonidou, Luis Galárraga, Nathalie Pernelle, Fatiha Saïs, Fabian M. Suchanek:
“VICKEY: Mining Conditional Keys on Knowledge Bases” (pdf)
Full paper at the International Semantic Web Conference (ISWC), 2017

Fabian M. Suchanek:
“Extraction d’informations” (pdf)
Book chapter in the Les Big Data à découvert , 2017

Ces dernières années nous avons assisté à un accroissement significatif du nombre de bases de connaissances volumineuses telles que DBpedia, YAGO, Freebase ou le Google Knowledge Graph. Le succès du Linked Open Data, qui répertorie des milliers de bases de connaissances, témoigne de l'ampleur de ce mouvement. Les bases de connaissances utilisent RDF pour décrire leurs ressources et donc intrinsèquement, n'ont pas de schéma associé. Nous proposons d'utiliser l'extraction de règles à partir des données pour en déduire automatiquement des contraintes de schéma. En s'appuyant sur les récentes avancées dans le domaine, nous proposons d'élargir le champ de l'extraction des règles au règles numériques et existentielles. Les contraintes qui en découlent pourraient être utilisées pour repérer les erreurs dans les données ou même pour prédire les pièces manquantes dans les bases de connaissances. Le défi spécifique au contexte des bases de connaissances est l'absence des contre-exemples. De nouvelles approches doivent donc être envisagées pour l'extraction des réglés.

Coordinateur du projet

Monsieur Fabian Suchanek (Institut Mines-Télécom)

L'auteur de ce résumé est le coordinateur du projet, qui est responsable du contenu de ce résumé. L'ANR décline par conséquent toute responsabilité quant à son contenu.


LTCI - TELECOM PARISTECH Institut Mines-Télécom

Aide de l'ANR 250 043 euros
Début et durée du projet scientifique : septembre 2016 - 36 Mois

Liens utiles

Inscrivez-vous à notre newsletter
pour recevoir nos actualités
S'inscrire à notre newsletter