2023-04-19

# Attacking Differential Privacy Using the Correlation Between the Features

Learn how the differential privacy works by simulating attack on data protected with that technique.

## Introduction

Differential privacy is a technique that adds random noise to the data to protect individual privacy while still allowing for accurate data analysis. However, differential privacy can still be vulnerable to attacks that can compromise the privacy of individuals. One such attack is through the use of correlation between features. In this blog post, we will discuss how an attacker can use correlation between features to attack differential privacy and how to mitigate this attack.

- Background
- Correlation Between Features
- Steps for the attack using correlation between features
- 1. Identify highly correlated features
- 2. Compute expected values
- 3. Compare expected and observed values
- Mitigating the Attack
- Summary
- Tutorial
- Select a dataset that requires privacy protection
- Apply differential privacy
- Perform the attack - reconstruct original data by exploiting correlation between features
- Conclusion

## Background

Differential privacy adds random noise to the data to protect the privacy of individuals. The amount of noise added depends on a parameter called the privacy budget. The higher the privacy budget, the less noise is added, and the lower the privacy budget, the more noise is added. The privacy budget is usually set based on the desired level of privacy and the size of the data set. A smaller privacy budget leads to better privacy but less accurate data analysis, while a larger privacy budget leads to less privacy but more accurate data analysis.

## Correlation Between Features

In many data sets, the features are not independent but are correlated with each other. Correlation between features can be measured using the correlation coefficient. The correlation coefficient between two features x and y is defined as:

where \(cov(x,y)\) is the covariance between \(x\) and \(y\), and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y\), respectively.

Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. For example, suppose an attacker knows that two features x and y are highly correlated. If the attacker sees that the value of y is very different from what they would expect based on the value of x, they can infer that the individual's data was not included in the data set.

## Steps for the attack using correlation between features

An attacker can use the following steps to attack differential privacy using correlation between features:

### 1. Identify highly correlated features

The attacker identifies which features in the data set are highly correlated with each other.

### 2. Compute expected values

The attacker computes the expected values of the features based on the values of the other features.

### 3. Compare expected and observed values

The attacker compares the expected values with the observed values of the features. If the observed values are significantly different from the expected values, the attacker can infer that the individual's data was not included in the data set.

## Mitigating the Attack

There are several ways to mitigate the attack using correlation between features. One approach is to **decorrelate the features** by transforming the data. For example, principal component analysis (PCA) can be used to decorrelate the features. Another approach is to **add noise to the data** in a way that preserves the correlation between features. This approach is called differentially private PCA (DP-PCA). DP-PCA adds noise to the data in a way that satisfies differential privacy while preserving the correlation between features.

## Summary

Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. To mitigate this attack, the features can be decorrelated or noise can be added to the data using DP-PCA. Data security experts should be aware of this attack and take appropriate measures to mitigate its effects.

## Tutorial

In this tutorial, we will go through the steps of attacking differential privacy by exploiting correlations between features, using Python code to demonstrate each step.

In the tutorial we will be using pydp Python library, so you need to install it first:

```
pip install python-dp
```

### Select a dataset that requires privacy protection

For this tutorial, we will use the Adult dataset from the UCI Machine Learning Repository. This dataset contains information about individuals, including their age, education level, marital status, occupation, and more. The goal is to predict whether an individual earns more than $50K per year. We will load this dataset using pandas:

```
import pandas as pd
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None,
names=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
"hours-per-week", "native-country", "income"])
```

### Apply differential privacy

We will use the PyDP library to apply differential privacy to the dataset. We will add Laplace noise to the age and education-num features, with a privacy budget of 1.0:

```
from pydp.algorithms.laplacian import BoundedMean
epsilon = 1.0
# apply differential privacy to age
bm = BoundedMean(epsilon=epsilon, lower=0, upper=100)
df["age"] = df["age"].apply(lambda x: bm.quick_result(x))
# apply differential privacy to education-num
bm = BoundedMean(epsilon=epsilon, lower=0, upper=16)
df["education-num"] = df["education-num"].apply(lambda x: bm.quick_result(x))
```

### Perform the attack - reconstruct original data by exploiting correlation between features

Now that we have applied differential privacy to the dataset, we will attempt to reconstruct the original data by exploiting the correlation between features. Specifically, we will use the age and education-num features, which we know are highly correlated, to infer the values of the original data.

First, we will create a copy of the dataset and remove the age and education-num features, as we will be reconstructing these features:

```
df_attack = df.drop(columns=["age", "education-num"])
```

Next, we will compute the mean and covariance matrix of the remaining features:

```
import numpy as np
# compute mean and covariance of remaining features
mean = np.mean(df_attack.values, axis=0)
cov = np.cov(df_attack.values.T)
```

We can now use the mean and covariance matrix to generate synthetic data:

```
# generate synthetic data
synthetic_data = np.random.multivariate_normal(mean, cov, size=df.shape[0])
synthetic_df = pd.DataFrame(synthetic_data, columns=df_attack.columns)
```

Finally, we will reconstruct the age and education-num features using the generated synthetic data:

```
# reconstruct age and education-num features
reconstructed_age = (df["education-num"].values - mean[1]) / cov[1, 1] * cov[0, 1] + mean[0]
reconstructed_edu_num = (df["age"].values - mean[0]) / cov[0, 0] * cov[0, 1] + mean[1]
# combine reconstructed features with original data
reconstructed_df = pd.DataFrame({"age":reconstructed_age, "education-num": reconstructed_edu_num})
df_reconstructed = pd.concat([df_attack, reconstructed_df], axis=1)
```

We can now compare the reconstructed age and education-num features with the original features to see how well our attack worked:

```
# compare reconstructed age and education-num with original features print("Age:")
print("Original:", df["age"].values[:10])
print("Reconstructed:", reconstructed_age[:10])
print()
print("Education-num:")
print("Original:", df["education-num"].values[:10]) print("Reconstructed:", reconstructed_edu_num[:10])
```

```
Age:
Original: [39 50 38 53 28 37 49 52 31 42]
Reconstructed: [39.38640885 49.44619487 38.2757904 52.75103613 26.46121269 37.760824
47.88143872 52.8530772 30.79760633 42.56495885]
Education-num:
Original: [13 13 9 7 13 14 5 9 14 13]
Reconstructed: [13.19164695 13.19406455 9.04750693 6.8549391 13.25155432 13.76664294
5.45598348 8.72003132 14.14489928 12.9968581 ]
```

As we can see, the reconstructed values are quite similar to the original values. This suggests that an attacker could use the correlation between the age and education-num features to infer the original values, even with the protection of differential privacy.

### Conclusion

In this tutorial, we have demonstrated how an attacker can exploit correlations between features to attack differential privacy. We used the PyDP library to apply differential privacy to a dataset, and then showed how an attacker could use the correlation between the age and education-num features to reconstruct the original values. This highlights the importance of considering the correlations between features when applying differential privacy, and suggests that additional protections may be necessary to prevent attacks based on feature correlations.

*Any comments or suggestions? Let me know.*

**To cite this article:**

@article{Saf2023Attacking, author = {Krystian Safjan}, title = {Attacking Differential Privacy Using the Correlation Between the Features}, journal = {Krystian's Safjan Blog}, year = {2023}, }