CE23 - Intelligence Artificielle 2020

artificial text COrpus DEsIgNed Ethically : automatic synthesis of clinical documents – CODEINE

Submission summary

Machine learning methods have become prevalent in language technologies. They rely on annotated corpora to train and evaluate models. The CoDeinE project proposes to address the lack of shareable corpora in sensitive domains such as health or banking. The key idea of the project is to define methods for paraphrase generation and apply them to confidential corpora to automatically generate synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. The project addresses important issues in natural language processing and is also concerned with defining confidentiality criteria to ensure that no original confidential information is found in the generated synthetic texts. We will use clinical documents in electronic patient records as a case study. Furthermore, the project will rely on Games With A Purpose and crowd sourcing to validate and annotate the synthesized texts.

Project coordination

Aurélie Névéol (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

LORIA Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
LIST Laboratoire d'Intégration des Systèmes et des Technologies
CRC CENTRE DE RECHERCHE DES CORDELIERS

Help of the ANR 558,772 euros
Beginning and duration of the scientific project: March 2021 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter