artificial text COrpus DEsIgNed Ethically : automatic synthesis of clinical documents – CODEINE
Machine learning methods have become prevalent in language technologies. They rely on annotated corpora to train and evaluate models. The CoDeinE project proposes to address the lack of shareable corpora in sensitive domains such as health or banking. The key idea of the project is to define methods for paraphrase generation and apply them to confidential corpora to automatically generate synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. The project addresses important issues in natural language processing and is also concerned with defining confidentiality criteria to ensure that no original confidential information is found in the generated synthetic texts. We will use clinical documents in electronic patient records as a case study. Furthermore, the project will rely on Games With A Purpose and crowd sourcing to validate and annotate the synthesized texts.
Project coordination
Aurélie Névéol (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
LORIA Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
LIST Laboratoire d'Intégration des Systèmes et des Technologies
CRC CENTRE DE RECHERCHE DES CORDELIERS
Help of the ANR 558,772 euros
Beginning and duration of the scientific project:
March 2021
- 48 Months