Getting Started
This guide walks you through a basic usage example of the privattacks Python package, showing how to load data, define quasi-identifiers, and evaluate both re-identification and attribute inference vulnerabilities.
Data Preparation
We begin by creating a simple synthetic dataset using pandas.
import pandas as pd
import numpy as np
import privattacks
df = pd.DataFrame({
"age":[20,30,30,30,30,55,55,55],
"education":["Master", "High School", "High School", "PhD", "PhD", "Bachelor", "Bachelor", "Bachelor"],
"income":["low", "medium", "low", "medium", "medium", "high", "high", "medium"]
})
display(df)
age |
education |
income |
|---|---|---|
20 |
Master |
low |
30 |
High School |
medium |
30 |
High School |
low |
30 |
PhD |
medium |
30 |
PhD |
medium |
55 |
Bachelor |
high |
55 |
Bachelor |
high |
55 |
Bachelor |
medium |
This dataset contains three columns:
age and education are considered quasi-identifiers (QIDs).
income is treated as a sensitive attribute.
Defining QIDs and Sensitive Attribute
qids = ["age", "education"]
sensitive = "income" # It's possible to run the attack for a list of sensitive attributes
These variables are passed to the privattacks data wrapper and attack engine.
data = privattacks.data.Data(dataframe=df)
attack = privattacks.attacks.Attack(data)
Note
The dataset can be read directly from a file, see privattacks.data.
Evaluating Prior and Posterior Vulnerabilities
We first calculate the prior and posterior vulnerabilities for:
Re-identification attacks
Attribute inference attacks
prior_reid = attack.prior_vulnerability("reid")
prior_ai = attack.prior_vulnerability("ai", sensitive)
posterior_reid = attack.posterior_vulnerability("reid", qids)
posterior_ai = attack.posterior_vulnerability("ai", qids, sensitive)
print(f"Re-identification\n"+\
f"Prior vulnerability; {prior_reid:.5f}\n"+\
f"Posterior vulnerability: {posterior_reid:.5f}")
print(f"\nAttribute inference - {sensitive}\n"+\
f"Prior vulnerability; {prior_ai[sensitive]:.5f}\n"+\
f"Posterior vulnerability: {posterior_ai[sensitive]:.5f}")
Re-identification
Prior vulnerability; 0.12500
Posterior vulnerability: 0.50000
Attribute inference - income
Prior vulnerability; 0.500000
Posterior vulnerability: 0.75000
This provides an initial assessment of the risk posed by attackers with and without auxiliary information (quasi-identifiers).
Using the Optimized Evaluation Method
For convenience and performance, you can run both attacks in a single call:
posteriors = attack.posterior_vulnerability("all", qids, sensitive)
print(f"Re-identification\n"+\
f"Posterior vulnerability: {posteriors['reid']:.5f}")
print(f"\nAttribute inference - {sensitive}\n"+\
f"Posterior vulnerability: {posteriors['ai'][sensitive]:.5f}")
Re-identification
Posterior vulnerability: 0.50000
Attribute inference - income
Posterior vulnerability: 0.75000
Analyzing Individual Vulnerabilities
You can also inspect the distribution of vulnerabilities per record using the distribution=True flag.
posterior_reid, hist_reid = attack.posterior_vulnerability("reid", qids, distribution=True)
print(f"Re-identification - distribution on records\n"+\
f"{hist_reid}\nMean of the distribution: {np.mean(hist_reid)}")
posterior_reid, hist_ai = attack.posterior_vulnerability("ai", qids, sensitive, distribution=True)
print("\nAttribute inference - distribution on records\n"+\
f"{sensitive}:\n{hist_ai[sensitive]}\nMean of the distribution: {np.mean(hist_ai[sensitive])}")
Re-identification - distribution on records
[1. 0.5 0.5 0.5 0.5 0.33333333
0.33333333 0.33333333]
Mean of the distribution: 0.5
Attribute inference - distribution on records
income:
[1. 0.5 0.5 1. 1. 0.66666667
0.66666667 0.66666667]
Mean of the distribution: 0.75
Optimized Method with Distributions
The optimized method also supports distributions:
posteriors = attack.posterior_vulnerability("all", qids, sensitive, distribution=True)
posterior_reid, hist_reid = posteriors["reid"]
posteriors_ai, hist_ai = posteriors["ai"]
print("Re-identification histogram\n"+\
f"{hist_reid}")
print("\nAttribute inference histogram\n"+\
f"{sensitive}:\n"+\
f"{hist_ai[sensitive]}")
Re-identification histogram
[1. 0.5 0.5 0.5 0.5 0.33333333
0.33333333 0.33333333]
Attribute inference histogram
income:
[1. 0.5 0.5 1. 1. 0.66666667
0.66666667 0.66666667]
Evaluating Multiple Combinations of QIDs
You can evaluate the vulnerabilities for all combinations of the QIDs (e.g., single attributes, pairs, etc.):
combinations = list(range(1, len(qids)+1)) # Sizes 1 to len(qids)
results_reid = attack.posterior_vulnerability(
atk="reid",
qids=qids,
combinations=combinations,
n_processes=2
)
display(results_reid)
n_qids |
qids |
posterior_reid |
|
|---|---|---|---|
0 |
1 |
age |
0.375000000 |
1 |
1 |
education |
0.500000000 |
2 |
2 |
age,education |
0.500000000 |
results_ai = attack.posterior_vulnerability(
atk="ai",
qids=qids,
sensitive=sensitive,
combinations=combinations,
distribution=True,
n_processes=2
)
display(results_ai)
n_qids |
qids |
posterior_income |
posterior_income_record |
|
|---|---|---|---|---|
0 |
1 |
age |
0.750000000 |
[1.00000000, 0.50000000, 0.50000000, 1.0000000… |
1 |
1 |
education |
0.750000000 |
[1.00000000, 0.50000000, 0.50000000, 1.0000000… |
2 |
2 |
age,education |
0.750000000 |
[1.00000000, 0.50000000, 0.50000000, 1.0000000… |
You can run both types of attack simultaneously for all combinations:
results = attack.posterior_vulnerability(
atk="all",
qids=qids,
sensitive=sensitive,
combinations=combinations,
n_processes=2
)
display(results)
n_qids |
qids |
posterior_reid |
posterior_income |
|
|---|---|---|---|---|
0 |
1 |
age |
0.375000000 |
0.750000000 |
1 |
1 |
education |
0.500000000 |
0.750000000 |
2 |
2 |
age,education |
0.500000000 |
0.750000000 |
This approach provides a comprehensive evaluation of how different combinations of quasi-identifiers affect vulnerability.