Utility of Semantic Privacy Notions for Correlated Data

In research on data privacy, differential privacy (DP) is the method of choice to measure privacy leakage while conducting data analysis. A DP algorithm is an algorithm that limits the privacy leakage when its result is released publicly. However, privacy attacks are possible against DP algorithms i...

Descripción completa

Detalles Bibliográficos
Autor: Lange, Martin Otto
Tipo de recurso: tesis de maestría
Fecha de publicación:2024
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/422584
Acceso en línea:https://hdl.handle.net/2117/422584
Access Level:acceso abierto
Palabra clave:Correlation (Statistics)
Algorithms
Data sets
Data protection
Private data analysis
correlation
differential privacy
Bayesian differential privacy
privacy-utility trade-off
Correlació (Estadística)
Algoritmes
Conjunts de dades
Protecció de dades
Àrees temàtiques de la UPC::Informàtica::Informàtica teòrica
Descripción
Sumario:In research on data privacy, differential privacy (DP) is the method of choice to measure privacy leakage while conducting data analysis. A DP algorithm is an algorithm that limits the privacy leakage when its result is released publicly. However, privacy attacks are possible against DP algorithms if the underlying data is correlated -- even if the privacy leakage according to DP is low. As a response, researchers have proposed various alternative semantic privacy notions that measure privacy leakage while taking correlation into account. However, increasingly strict privacy measurements call into question whether this allows for sufficient utility for data analysis algorithms. The theoretical limits on their utility remain unclear. Therefore, this thesis investigates the utility of algorithms that limit privacy leakage according to Bayesian differential privacy (BDP), a common extension of DP. We directly prove statements about the utility of BDP algorithms, as well as how much the privacy leakage of a DP algorithm increases when measuring it with BDP. We present novel theoretical results for various correlation models of the data: arbitrary correlation, genomic data, multivariate Gaussian correlation and data following a Markov chain. We find that BDP algorithms cannot provide good utility when the correlation model is completely arbitrary. However, we also show that the increase in measured privacy leakage from DP to BDP is highly dependent on the specific correlation model. These results enable the creation of BDP algorithms with high utility for correlation models where the increase in privacy leakage is relatively small. For example, high utility BDP algorithms on data from a multivariate Gaussian correlation model are possible if the strength of correlation is sufficiently weak -- even if all records in the data are correlated with each other. Additionally, we empirically test the utility of BDP algorithms on real-world data sets and find improvements over the state of the art.