DS0707 - Interactions humain-machine, objets connectés, contenus numériques, données massives et connaissance

Template acquisition pour open event extraction – ASRAEL

Submission summary

Information and communication society led to the production of huge volumes of content. This content is still generally non-structured (text, images, videos) and the promises of a "Web of Knowledge" are still long ahead. This situation evolves with the development of Open Data portals or resources such as DBPedia, that have made easier the access to information stored in databases (economic or demographic statistics, world knowledge contained in Wikipedia infoboxes, etc). However, most of the knowledge is still produced by textual data. Among the information concerned by the difficulty of accessing textual data, those related to events are of great interest, notably in the context of the emergence of data journalism. Data journalism have been fed until now by publicly available, statistical data, but it has paradoxically made only little use of the very journalistic materials that are events. The project ASRAEL aims at bridging this gap.

Our proposal comes within the scope of the general scientific framework of information extraction (IE). We aim at extracting events from a large set of textual documents, without prior knowledge about them, and at populating and publishing a knowledge base of events. This knowledge base will be the support of a dedicated event search engine.

We define event in a traditional information extraction way. An event is a structured representation of something that happens, with a nucleus, a spatio-temporal context and some arguments. The "event type" gathers comparable instances of events, as "earthquake", "election" or "car race". Arguments are attribute/value pairs that characterize an event type (for an earthquake, its location, date, magnitude, casualties...). A template is the set of arguments that can describe an event type (earthquake template, election template). The generic representation of an event is based on the rule of the "5 Ws" (What, Who, Where, When, Why) that prevails in the "Anglo-Saxon" way of writing articles. This rule stipulates that a good description of an event must make these five elements explicit.

In automatic information extraction, the information about "Who", "Where" and "When" are extracted by a traditional and quite generic named entity recognition approach. On the other hand, the "What" is very domain-specific. For this reason, traditional IE systems lean on templates predefined by experts and identify events in texts with either rule-based systems or statistical models. However, in the general domain, where the huge number of possible events makes the manual definition of these templates impossible, information retrieval ("bag of words") methods take over, but do not provide a structured answer.

In this project, we aim to tackle the following challenges:
- Discover automatically event templates from very large text corpora, and populate a knowledge base dedicated to events. This implies a mixture of supervised and non-supervised approaches, which is necessary as soon as one consider such a generic problem.
- Use this knowledge base in order to build an event aggregator and a semantic search engine. With this engine, a user (either journalist or end-user) will be able to query for an event type (e.g. earthquake) and provide filters on attribute values (location = Turkey, magnitude > 8, etc). The knowledge base will also be published following the linked data principles for other to re-use.

Project coordination

Xavier Tannier (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

AGENCE FRANCE PRESSE
EURECOM EURECOM
CNRS-LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
CEA LIST Commissariat à l'énergie atomique et aux énergies alternatives

Help of the ANR 653,248 euros
Beginning and duration of the scientific project: December 2015 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter