CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle 2018

SEmantic Networks of Data: Utility and Privacy – SENDUP

SENDUP - SEmantic Networks of Data: Utility and Privacy

The amount of data produced by individuals and corporations has dramatically increased during the last decades. This generalized production of data brings opportunities but also new privacy challenges. While open and linked data are growing in importance, the general public express a growing distrust over personal data exploitation. This leads to an important new societal challenge: how can privacy be preserved while providing useful data?

Objective: Respecting privacy while querying and publishing graphs of semantic data

Nowadays, data are often organized as graphs with an underlying semantic to allow efficient querying and support inference engines. Such is the case in, for example, linked data and semantic web typically relying on RDF representation. Yet, while anonymization of tabular databases and untyped homogeneous graphs are well-researched, anonymity in such databases have been mildly studied. Their anonymization remains a challenge due to their inherent heterogeneity and semantic. Hence, the SENDUP project focuses on personal data represented as graphs with an underlying semantic in general, and in the RDF model in particular. It aims at producing a software suite for:<br />(1) querying semantic data graphs while preventing illegitimate use of private data<br />(2) publishing useful semantic data graphs while preserving the privacy of individuals whose personal data are published.<br /><br />To do so, the SENDUP project will introduce new formalisms and techniques related to the sanitization and update of semantic data graphs.

Approach: Differential privacy and update management through compensation actions in semantic data graphs

SENDUP will firstly enrich the state of the art on data anonymization by proposing new concepts and techniques granting formal privacy guarantees in the considered databases. To this end, we will adapt differential privacy approaches to typed graphs presenting an underlying semantic. These adaptations will take into account the heterogeneity of the vertices and the existence of logical and semantic relations within the base. As these techniques lead to data degradation, appropriate quality metrics will be introduced to validate their usefulness.

Anonymization of data requires their modification. However, databases represented as semantic graphs can be subject to structural or integrity constraints and be associated with inference rules. These constraints must be preserved and these rules considered when updating such databases. To support our anonymization techniques, we will therefore define management techniques for updating semantic data graphs. First, we will guarantee the satisfaction of constraints and logical relations during the application of an update. This may lead to situations where an update is not applicable because it would result in a constraint violation. Rejecting the update is however unacceptable if it is required by the anonymization process. Secondly, we will therefore guarantee the application of any update through the generation of compensatory actions. Such actions, called side effects, will take the form of additional updates. Thus, the deletion of a fact will be accompanied for example by the deletion of the facts allowing it to be re-inferred.

Results

At its mid-term, the results obtained within the framework of the SENDUP project are articulated in two axes.

Management of RDF/S database updates:
We formalized all atomic update on RDF/S databases, be it on the instance or the schema, using graph rewriting rules. We demonstrated that the application of any of these rules necessarily preserves RDF/S inherent semantic. We have proposed algorithms that generate side effects on the instance and the schema to ensure the applicability of any atomic update. These concepts have been implemented in SetUp, a dedicated module of our software suite that can be used independently.

Anonymization and privacy-preserving evaluation of queries:
We have developed a formal semantic of the SPARQL query language which is more uniform than what is proposed by the W3C standard and which allows the formalization of queries for formal validation purposes. We have proposed a characterization of the trust-level and knowledge of the actors of the project's targeted scenarios and have extracted 4 attack models. In general, we consider external analysts observing the published information. In the context of distributed databases, we also consider the absence of a trusted curator, as well as honest, honest but curious and malicious (lazy or liars) internal actors. We proposed, implemented and experimentally validated differentially private solutions for all these attack models as part of a preliminary investigation regarding the identification of influential nodes in social networks.

Prospects

SENDUP will provide theoretical and technical solutions for privacy-preservation in semantic data graph. Such graphs are -and will remain in the foreseeable future- a key element of linked data and the semantic web, and are, as such, very much concerned by privacy issues. While recent technological advances promise exciting societal evolutions, the general public is rightfully worried by their growing intrusiveness and the exploitation of personal data. Providing appropriate solutions to restore its trust in personal data exploitation is a key requisite for the implementation of these evolutions and the promotion of open data movements.

Scientific productions and patents

Publications:

Jacques Chabin, Cédric Eichler, Mirian Halfeld Ferrari, Nicolas Hiot. “Graph Rewriting Rules for RDF Database Evolution Management”. International Conference on Information Integration and Web-based Applications & Services, Nov-Dec 2020, Thailand.

Dominique Duval, Rachid Echahed and Frédéric Prost. «An Algebraic Graph Transformation Approach for RDF and SPARQL«. International Workshop on Graph Computation Models, June 2020, Norway.

Cédric Eichler, Pascal Berthomé, Jacques Chabin, Rachid Echahed, Mirian Halfeld Ferrari, Benjamin Nguyen, Frédéric Prost. “SEmantic Networks of Data: Utility and Privacy”. Atelier sur la Protection de la Vie Privée (APVP'19), July 2019, Cap Hornu, France.

Cédric Eichler, Pascal Berthomé, Jacques Chabin, Rachid Echahed, Mirian Halfeld Ferrari, Benjamin Nguyen, Frédéric Prost. “SEmantic Networks of Data: Utility and Privacy”. RESSI 2019: Rendez-vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, May 2019, Erquy, France.

Submission summary

The amount of data produced by individuals and corporations has dramatically increased during the last decades. This generalized gathering of data brings opportunities (e.g., building new knowledge using this "Big Data") but also new privacy challenges. The general public express a growing distrust over personal data exploitation, which has been met with successive strengthened regulations (e.g. EU general data protection regulation, GDPR). In the meantime, open data is taking a crucial place within many administrations. The open data policy is a powerful move by public institutions aiming at publishing data collected by public agent. The objective is to manage this data as an asset to make it available, discoverable, and usable by anyone. Both the US and the European Community have foundations to promote this policy. This leads to an important new societal challenge at the crossroads of these social evolutions: how can privacy be preserved while publishing useful data?

Nowadays, data are often organized as graphs with an underlying semantic to allow efficient querying and support inference engines. Such is the case in, for example, linked data and semantic web typically relying on RDF. The SEND UP project focuses on such databases and will follow two main goals: (1) prevent illegitimate use of private data while querying semantic data graphs and (2) publish useful sensitive semantic data graphs will preserving privacy.

A massive amount of work has focused on privacy in data presented as tables. They have resulted in multiple well-established models, such as k-anonymity, l-diversity, and differential privacy. More recently, these concepts have been translated and applied to graph representations, but mainly in the context of social networks. These methods usually consider homogeneous nodes with no semantic relation and aim at protecting the graph topology. More often than not, their utility is experimentally evaluated with regard to specific sets of functions and/or graph characteristics (e.g., diameter, max degree and degree distribution...). To achieve semantic data graph sanitization, the SEND UP project aims at:

- Introduce knowledge-based and usage-based utility metrics, related to facts present in, or that can be deduced from, the base. Indeed, due to the nature of the targeted graph utility metrics and evaluation can not rely on the preservation of, for example, the diameter of the graph.

- Fully define the side-effects of transformations in semantic graph databases and introduce methods and tools to handle them. Indeed, updating instances of semantic data graphs during their sanitization implies many new difficulties including side-effects on the instances but also on their schema and constraints. The sanitization context brings issues that have been mildly studied in the literature (e.g., updating incomplete data-bases, triggering schema/constraints evolutions as side-effects of instance updates...) and even completely new ones (e.g., solving non-deterministic updates as an optimization problem regarding privacy and utility metrics).

- Introduce new sanitization concepts granting privacy guarantees in semantic graph databases and taking into account vertex heterogeneity and the existence of logical relations and semantic rules between attributes.

- Introduce methods and algorithms for semantic graph databases sanitization integrating new expanded anonymity concepts, usage-based and knowledge-based utility metrics but also transformations side-effects. Efficient techniques should account for side-effects during the decision process rather than merely triggering them afterward

These objectives are to be supported by a suite of software modules validated in lab implementing our proposed metrics and algorithms.

Cédric EICHLER (EA 4022 LABORATOIRE D'INFORMATIQUE FONDAMENTALE D'ORLÉANS)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIG Laboratoire d'Informatique de Grenoble
LIFO EA 4022 LABORATOIRE D'INFORMATIQUE FONDAMENTALE D'ORLÉANS

Help of the ANR 218,721 euros
Beginning and duration of the scientific project: - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.