JCJC SIMI 2 - JCJC - SIMI 2 - Science informatique et applications

Learning High-level Representations of large Sparse Tensors – EVEREST

EVEREST

Learning high-level representations of sparse tensors.<br />

Objectives

The general direction of the project is to define new ways of representing multi-relational data, with a focus on knowledge bases data, these directed graphs whose nodes corresponds to concepts and edges to relations among them. The main objective of EVEREST ifs to propose methods to model such data, especially in the large-scale setting (millions of nodes and thousands of relation types), in order to ease their manipulation, by completing, visualize or summarize them. Besides, the connection of these knowledges bases and free text is also studied, with relation extraction as target task, in order to conceive new representations, which might improve automatic knowledge bases construction.<br />

The EVEREST project lies in-between several domains such as machine learning, multilinear algebra, data analysis for the modeling options but also knowledge management, bio-informatics, the semantic web and recommender systems for the applications. Our first works concern the conception of new statistical models for modeling relational data. These are mainly based on advances in stochastic optimization, neural networks and matrix factorization.

EVEREST has already proposed innovations, in particular by recently proposing a method able to model very large-scale relational data (like the knowledge base Freebase with up to 1 millions concepts and 25k relation types). On these data, this new approach outperforms current state-of-the-art methods in link prediction. This is promising and further work will be conducted to improve its expressiveness. At the same time, first works on information extraction have also been conducted. A model from the EVEREST group, participated at the biomedical event extraction challenge (BioNLP 2013) and ranked 6th over 12 international competitors. This method has been further improved since this challenge submission and now outperforms the challenge winners.

Future work will lie along two lines: (1) improve the proposed approach for modeling multi-relational data in order to get a better encoding of the original data (right now some kinds of relations are better represented than others), (2) carry on working on information extraction in order to propose ways to allow a tighter connection between knowledge bases and free text.

Papers (journals and conference proceedings):
* Irreflexive and Hierarchical Relations as Translations (2013).
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston and Oksana Yakhnenko. at the ICML*2013 Workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs (poster), Atlanta, USA, June 2013.
* Biomedical Event Extraction by Multi-class Classification of Pairs of Text Entities (2013).
Xiao Liu, Antoine Bordes and Yves Grandvalet. in Proceedings of BioNLP Shared Task 2013 Workshop, ACL publishing, Sofia, Bulgaria. 2013.
* A Semantic Matching Energy Function for Learning with Multi-relational Data (2013).
Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. in Machine Learning. Springer, DOI: 10.1007/s10994-013-5363-6, May 2013.

Talks:
* Invited talk: Traiter des données relationnelles grâce à l'apprentissage automatique.
Antoine Bordes. Plateforme IA, Lille, Juillet 2013.
* Participation of Alberto Garcia-Duran at the Machine Learning Summer School, Tübingen, août 2013.

Huge amounts of structured and relational data are available in many domains of engineering, industry or research ranging from the Semantic Web, or bioinformatics to recommender systems. As a result, knowledge bases (KBs), such as Freebase, WordNet or GeneOntology, became essential tools for storing, manipulating and accessing information, but they are also incomplete, imprecise and far too large to be used as efficiently and broadly as they could. Hence, there is need for methods able to summarize, complete or merge these large databases. This is our main motivation. KBs are often represented as 3-dimensional tensors, so we will rely on tensor factorization methods to learn compact representations. Unfortunately, most existing tensor factorization approaches are not suitable for this problem because of the specific properties of the data we target, namely large-scale and sparse. By sparse we mean that many entries are unobserved. Hence, we propose to develop novel approaches to tensor factorization based on Deep Learning that will be tailored for this particular problem. Deep Learning concerns the training of deep neural networks: it is an emerging technique in Machine Learning and has demonstrated great capabilities on various tasks in computer vision or natural language processing. It is original and appealing to apply it to tensor factorization. The first phase of the project will consist in developing and evaluating this approach for deriving high-level representations of large sparse tensors. In a second phase, we plan to demonstrate the qualities of these new tensor representations on two concrete problems concerning KBs: link prediction and KB matching. In the latter case we will propose a novel framework explicitly handling uncertainties. We chose these two tasks because they are essential in many applications and could lead to huge impact. Indeed, link prediction, which is applied to uncover relationships in KBs that probably exist but have not been observed, is crucial for bioinformatics or recommender systems and matching, which is used to merge heterogeneous KBs, each developed independently, is essential for the Semantic Web. The overall objective of the EVEREST project is thus to bring a leap forward in factorization of large sparse tensors in order to improve the accessibility, completeness and reliability of real-world KBs. This line of research could have a huge impact in industry (Semantic Web, biomedical applications, etc.). For that reason, Xerox Research Center Europe is supporting this project and will supply data, provide expertise and ease industrial transfer. This proposal is also consistent with the long-term research direction of its principal partner, Heudiasyc, since it contributes in several aspects of the 10-years LabEx program on “Technological Systems of Systems” started in 2011.

Project coordination

Sébastien Destercke (Heuristique et Diagnostic des Systèmes Complexes) – sebastien.destercke@hds.utc.fr

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Heudiasyc Heuristique et Diagnostic des Systèmes Complexes

Help of the ANR 217,015 euros
Beginning and duration of the scientific project: December 2012 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter