Getting Started

This guide walks you through a basic usage example of the privattacks Python package, showing how to load data, define quasi-identifiers, and evaluate both re-identification and attribute inference vulnerabilities.

Data Preparation

We begin by creating a simple synthetic dataset using pandas.

import pandas as pd
import numpy as np
import privattacks

df = pd.DataFrame({
    "age":[20,30,30,30,30,55,55,55],
    "education":["Master", "High School", "High School", "PhD", "PhD", "Bachelor", "Bachelor", "Bachelor"],
    "income":["low", "medium", "low", "medium", "medium", "high", "high", "medium"]
})
display(df)

age	education	income
20	Master	low
30	High School	medium
30	High School	low
30	PhD	medium
30	PhD	medium
55	Bachelor	high
55	Bachelor	high
55	Bachelor	medium

This dataset contains three columns:

age and education are considered quasi-identifiers (QIDs).
income is treated as a sensitive attribute.

Defining QIDs and Sensitive Attribute

qids = ["age", "education"]
sensitive = "income" # It's possible to run the attack for a list of sensitive attributes

These variables are passed to the privattacks data wrapper and attack engine.

data = privattacks.data.Data(dataframe=df)
attack = privattacks.attacks.Attack(data)

Note

The dataset can be read directly from a file, see privattacks.data.

Evaluating Prior and Posterior Vulnerabilities

We first calculate the prior and posterior vulnerabilities for:

Re-identification attacks
Attribute inference attacks

prior_reid = attack.prior_vulnerability("reid")
prior_ai = attack.prior_vulnerability("ai", sensitive)
posterior_reid = attack.posterior_vulnerability("reid", qids)
posterior_ai = attack.posterior_vulnerability("ai", qids, sensitive)

print(f"Re-identification\n"+\
      f"Prior vulnerability; {prior_reid:.5f}\n"+\
      f"Posterior vulnerability: {posterior_reid:.5f}")

print(f"\nAttribute inference - {sensitive}\n"+\
      f"Prior vulnerability; {prior_ai[sensitive]:.5f}\n"+\
      f"Posterior vulnerability: {posterior_ai[sensitive]:.5f}")

Re-identification
Prior vulnerability; 0.12500
Posterior vulnerability: 0.50000

Attribute inference - income
Prior vulnerability; 0.500000
Posterior vulnerability: 0.75000

This provides an initial assessment of the risk posed by attackers with and without auxiliary information (quasi-identifiers).

Using the Optimized Evaluation Method

For convenience and performance, you can run both attacks in a single call:

posteriors = attack.posterior_vulnerability("all", qids, sensitive)

print(f"Re-identification\n"+\
      f"Posterior vulnerability: {posteriors['reid']:.5f}")

print(f"\nAttribute inference - {sensitive}\n"+\
      f"Posterior vulnerability: {posteriors['ai'][sensitive]:.5f}")

Re-identification
Posterior vulnerability: 0.50000

Attribute inference - income
Posterior vulnerability: 0.75000

Analyzing Individual Vulnerabilities

You can also inspect the distribution of vulnerabilities per record using the distribution=True flag.

posterior_reid, hist_reid = attack.posterior_vulnerability("reid", qids, distribution=True)
print(f"Re-identification - distribution on records\n"+\
      f"{hist_reid}\nMean of the distribution: {np.mean(hist_reid)}")

posterior_reid, hist_ai = attack.posterior_vulnerability("ai", qids, sensitive, distribution=True)
print("\nAttribute inference - distribution on records\n"+\
      f"{sensitive}:\n{hist_ai[sensitive]}\nMean of the distribution: {np.mean(hist_ai[sensitive])}")

Re-identification - distribution on records
[1.         0.5        0.5        0.5        0.5        0.33333333
0.33333333 0.33333333]
Mean of the distribution: 0.5

Attribute inference - distribution on records
income:
[1.         0.5        0.5        1.         1.         0.66666667
0.66666667 0.66666667]
Mean of the distribution: 0.75

Optimized Method with Distributions

The optimized method also supports distributions:

posteriors = attack.posterior_vulnerability("all", qids, sensitive, distribution=True)
posterior_reid, hist_reid = posteriors["reid"]
posteriors_ai, hist_ai = posteriors["ai"]

print("Re-identification histogram\n"+\
      f"{hist_reid}")

print("\nAttribute inference histogram\n"+\
      f"{sensitive}:\n"+\
      f"{hist_ai[sensitive]}")

Re-identification histogram
[1.         0.5        0.5        0.5        0.5        0.33333333
0.33333333 0.33333333]

Attribute inference histogram
income:
[1.         0.5        0.5        1.         1.         0.66666667
0.66666667 0.66666667]

Evaluating Multiple Combinations of QIDs

You can evaluate the vulnerabilities for all combinations of the QIDs (e.g., single attributes, pairs, etc.):

combinations = list(range(1, len(qids)+1))  # Sizes 1 to len(qids)

results_reid = attack.posterior_vulnerability(
    atk="reid",
    qids=qids,
    combinations=combinations,
    n_processes=2
)
display(results_reid)

	n_qids	qids	posterior_reid
0	1	age	0.375000000
1	1	education	0.500000000
2	2	age,education	0.500000000

results_ai = attack.posterior_vulnerability(
    atk="ai",
    qids=qids,
    sensitive=sensitive,
    combinations=combinations,
    distribution=True,
    n_processes=2
)
display(results_ai)

	n_qids	qids	posterior_income	posterior_income_record
0	1	age	0.750000000	[1.00000000, 0.50000000, 0.50000000, 1.0000000…
1	1	education	0.750000000	[1.00000000, 0.50000000, 0.50000000, 1.0000000…
2	2	age,education	0.750000000	[1.00000000, 0.50000000, 0.50000000, 1.0000000…

You can run both types of attack simultaneously for all combinations:

results = attack.posterior_vulnerability(
    atk="all",
    qids=qids,
    sensitive=sensitive,
    combinations=combinations,
    n_processes=2
)
display(results)

	n_qids	qids	posterior_reid	posterior_income
0	1	age	0.375000000	0.750000000
1	1	education	0.500000000	0.750000000
2	2	age,education	0.500000000	0.750000000

This approach provides a comprehensive evaluation of how different combinations of quasi-identifiers affect vulnerability.