CE23 - Intelligence Artificielle

Distant Supervision for Meeting Minutes with Rhetorical Relations – SUMM-RE

Weak Supervision for Meeting Minutes with Rhetorical Relations

Whereas state of the art approaches to summarization treat discourse as a mere sequence of utterances, we hypothesize that the rich semantic information represented through discourse relations and structure will help to identify discourse central threads in a conversation and retrieve important semantic information such as why a certain decision was made or where there was disagreement about a certain issue.

The general objective of the project is to use rich semantic information provided by discourse structure to improve algorithms for automatic summarization

- A central objective of SUMM-RE is to build upon extant work exploiting weak supervision to automatically annotate data sets for discourse structure by extending these methods to spontaneous, conversational speech. Quality discourse annotations generally require linguistic expertise - automatic or crowd-sourced labeling is not a viable alternative. But manual annotation requires significant effort. The development of weakly supervised approaches for complex NLP tasks would ge a game-changer for NLP.<br />- A second objective is to create a 100 hour audio/video corpus of spoken, multiparty, meeting-like interactions in French that will be valuable for researchers in numerous domains. This objective is motivated not only by the general lack of data sets for NLP tasks in French, but also by the central hypothesis of SUMM-RE, that information encoded in discourse graphs can be exploited to improve automatic summarization. The data set will be designed to study phenomena common to most meetings.<br />- A third objective of SUMM-RE is to use the annotations for discourse structure and relations generated using weak supervision to improve the automatic production of abstractive topic summaries and meeting minutes. While current approaches to automatic summarization assume that a conversation is merely a linear sequence of utterances,<br />SUMM-RE posits that exploiting information about long distance attachments will prove crucial for advancing the state of the art for abstractive summarization.<br />- LINAGORA is currently developing its Conversation Manager (CM): an open-source tool to help users create, in a semi-automatic fashion, detailed summaries of conversations in French or English. For a given conversation, the CM allows users to access, edit and markup the conversation transcript via the transcript editor. Users can create a summary by importing selected parts of the transcript into the summary helper and, if desired, selecting a template from a proposed list. While the transcript provided to the transcript editor is automatically produced (and modifiable by the user), there is currently minimal automatic support for the summary helper. A user can draw from automatically proposed keywords/phrases and a short topic summary, but the summaries are often unsatisfying and action items, decisions, etc., must be tagged by the user in the transcript editor and then imported into the summary helper. A further objective of SUMM-RE is thus to incorporate the algorithms for summarization into the CM to improve topic summarization and to allow the CM to automatically identify parts of a transcript (e.g. decisions) relevant for detailed summaries.

Weak supervision/data programming will be used for discourse segmentation (WP1) and relation annotation (WP2), transformer-based architectures will be used for discourse parsing (WP2) and summarization (WP3), graph neural networks will be used for summarization (WP3)

We have so far 55 hours of recordings of meeting-style conversations in French and more hours are coming every week. This is the first data set of its kind and should be a highly
valuable resource for future research in a variety of domains.

Working on a similar, though smaller, corpus of meetings in French, we have been able to show that using weak supervision/data programming to fine-tune a discourse segmentation model trained on text can be very successful (Gravallier et al 2021)

Reduction in the computational cost of pre-trained language models is of the utmost importance for a variety of reasons ranging from environmental impact to the accessibility of such models for researchers with limited computational researchers. The development of FrugalScore (Kamal Eddine et al 2022) has shown that such reduction is possible without significant sacrifices in quality.

The SUMM-RE corpus will serve as a valuable resource for a variety of NLP tasks in French

No patents have come out of this project at this stage (nor is it planned to have any).

It is becoming increasingly realistic to exploit transcriptions of spoken data for tasks that require comprehension of what is said in a conversation. SUMM-RE will combine expertise in theories of discourse interpretation with recent developments in distant supervision to improve the automatic production of meeting summaries and minutes from spoken data.

State of the art approaches to abstractive summarization treat discourse as a mere linear sequence of utterances. SUMM-RE posits that by exploiting information about discourse relations and the rich structures determined by relations between utterances, we can significantly improve models for abstractive summarization. A major hurdle to developing more sophisticated models of discourse structure for spoken, multiparty conversation is a lack of appropriate training data. SUMM-RE will address this problem in two ways. First, it will create a new and unique corpus of meeting-like interactions in French. Second, it will label this corpus and a large corpus of meeting-like interactions in English for discourse structure. The annotation approach will extend recent developments in distant supervision to develop labelling functions that can be used to automatically label large amounts of data. This approach has the very attractive advantage of harnessing linguistic expertise while keeping manual annotation to a minimum.

The automatically annotated data will be used to improve algorithms for both short topic summaries and more detailed meeting minutes. These algorithms in turn will be integrated into the lead partner's (LINAGORA's) semi-automatic summarization tool to significantly improve the output for its users. All project results (corpus and algorithms) will be released under an open-source license as a part of LINAGORA's LinTo/Conversation Manager offer.

Project coordination


The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


LPL Laboratoire Parole et Langage
IRIT Institut de Recherche en Informatique de Toulouse
LIX Laboratoire d'Informatique de l'Ecole Polytechnique

Help of the ANR 669,891 euros
Beginning and duration of the scientific project: December 2020 - 42 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter