CONTINT - Contenus et Interactions 2012

Analyzing Semantics with Frames : Annotation, Lexicon, Discourse and Automation – ASFALDA

Automatic semantic analysis for intelligent access to electronic content

The project will provide resources and tools for automatic retrieval of prototypical situations (such as communications, moves, commercial transactions...) and will thus help to improve automatic acces to the meaning of electronic content, which can give a crucial technical boost to automatic summarization, document classification or information extraction.

Automatically finding prototypical situations for intelligent access to electronic content

Nowadays the majority of web searches is performed through a very basic representation of electronic content : key-words.<br />The aim of the Asfalda project is to improve semantic analysis of text, by providing resources and tools for finding prototypical semantic frames. The target analysis can be summed up as finding «who did what when why ?«.<br />Such an analysis can help succeed in major challenges for an intelligent access to electronic content.

Tools and resourcs for frame-based semantic analysis : a French FrameNet

The project will build on the FrameNet database that define a model for representing proptotypical situations (frames), and that has already built over 1000 frames, along with annotations on English sentences.
We will make use on existing machinary to automatically project the English FrameNet to a French FrameNet, we will make use of automatic pre-annotation using a syntax-semantic linking lexicon, and we will manually validat a substantial part of the resulting French FrameNet.

Results

There are no result yet, sinc the project started 6 months ago.

Prospects

Scientific productions and patents

There are no result yet, sinc the project started 6 months ago.

Submission summary

The ASFALDA project aims to provide both a French corpus with semantic annotations and an automatic tool for shallow semantic analysis, obtained using semi-supervised machine learning techniques trained on this corpus. The target semantic annotations can be characterized roughly as an explicitation of “who does what when and where”, that abstracts away from word order / syntactic variation, and to some of the lexical variation found in natural language.
The project also comprises the use of the semantic analyzer within a search engine, embedded in a content management tool, with an evaluation of the impact of semantic indexing on user experience.

The project will contribute to the major challenge of the generalization of electronic content, and the subsequent need for sophisticated tools :
• that access to content, in various ways : efficient information retrieval, document summary, document classification, machine translation, information extraction
• that make inference over annotated content
To achieve these objectives, we rely on an existing standard for semantic annotation of predicates and roles (FrameNet), and on existing previous effort of linguistic annotation for French (the French Treebank).
The original FrameNet project, which deals with English, provides a structured set of prototypical situations, called frames, along with a semantic characterization of the participants of these situations (called “roles”). We propose to take advantage of this semantic database, which has proved largely portable across languages, to build a French FrameNet, meaning both a lexicon listing which French lexemes can express which frames, and an annotated corpus in which occurrences of frames and roles played by participants are made explicit. The addition of semantic annotations to the French Treebank, which already contains morphological and syntactic annotations, will boost its usefulness both for linguistic studies and for machine-learning-based Natural Language Processing applications for French, such as content semantic annotation, text mining or information extraction.
To cope with the intrinsic coverage difficulty of such a project, we adopt a hybrid strategy to obtain both exhaustive annotation for some specific selected concepts (commercial transaction, communication, causality, sentiment and emotion, time), and exhaustive annotation for some highly frequent verbs.

The scientific key aspects of the project are :
• an emphasis on the diversity of ways to express the same frame, including expression (such as discourse connectors) that cross sentence boundaries,
• an emphasis on semi-supervised techniques for semantic analysis, to generalize over the available annotated data

The project is ambitious and could neither be achieved without intensive collaboration, nor by any one partner alone. The partners involved provide a strong synergy, with competence in linguistic annotations (LLF, Alpage, IRIT), discourse analysis (IRIT and Alpage), syntactic parsing and machine learning techniques (Alpage, LIF, CEA LIST) and NLP-enhanced search engines (CEA LIST and Ant’inno).

Marie CANDITO (Centre de recherche INRIA Paris - Rocquencourt / EPI Alpage)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

IRIT Institut de Recherche en Informatique de Toulouse
LIF Laboratoire d'Informatique Fondamentale de Marseille
Ant'Inno Société Ant'Inno
LLF Laboratoire de Linguistique Formelle
CEA LIST Commissariat à l'Energie Atomique et aux Energies Alternatives
ALPAGE Centre de recherche INRIA Paris - Rocquencourt / EPI Alpage

Help of the ANR 791,706 euros
Beginning and duration of the scientific project: September 2012 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.