Getting Started

This guide walks you through a basic usage example of the privattacks Python package, showing how to load data, define quasi-identifiers, and evaluate both re-identification and attribute inference vulnerabilities.

Data Preparation

We begin by creating a simple synthetic dataset using pandas.

import pandas as pd
import numpy as np
import privattacks

df = pd.DataFrame({
    "age":[20,30,30,30,30,55,55,55],
    "education":["Master", "High School", "High School", "PhD", "PhD", "Bachelor", "Bachelor", "Bachelor"],
    "income":["low", "medium", "low", "medium", "medium", "high", "high", "medium"]
})
display(df)

age

education

income

20

Master

low

30

High School

medium

30

High School

low

30

PhD

medium

30

PhD

medium

55

Bachelor

high

55

Bachelor

high

55

Bachelor

medium

This dataset contains three columns:

  • age and education are considered quasi-identifiers (QIDs).

  • income is treated as a sensitive attribute.

Defining QIDs and Sensitive Attribute

qids = ["age", "education"]
sensitive = "income" # It's possible to run the attack for a list of sensitive attributes

These variables are passed to the privattacks data wrapper and attack engine.

data = privattacks.data.Data(dataframe=df)
attack = privattacks.attacks.Attack(data)

Note

The dataset can be read directly from a file, see privattacks.data.

Evaluating Prior and Posterior Vulnerabilities

We first calculate the prior and posterior vulnerabilities for:

  • Re-identification attacks

  • Attribute inference attacks

prior_reid = attack.prior_vulnerability("reid")
prior_ai = attack.prior_vulnerability("ai", sensitive)
posterior_reid = attack.posterior_vulnerability("reid", qids)
posterior_ai = attack.posterior_vulnerability("ai", qids, sensitive)

print(f"Re-identification\n"+\
      f"Prior vulnerability; {prior_reid:.5f}\n"+\
      f"Posterior vulnerability: {posterior_reid:.5f}")

print(f"\nAttribute inference - {sensitive}\n"+\
      f"Prior vulnerability; {prior_ai[sensitive]:.5f}\n"+\
      f"Posterior vulnerability: {posterior_ai[sensitive]:.5f}")
Re-identification
Prior vulnerability; 0.12500
Posterior vulnerability: 0.50000

Attribute inference - income
Prior vulnerability; 0.500000
Posterior vulnerability: 0.75000

This provides an initial assessment of the risk posed by attackers with and without auxiliary information (quasi-identifiers).

Using the Optimized Evaluation Method

For convenience and performance, you can run both attacks in a single call:

posteriors = attack.posterior_vulnerability("all", qids, sensitive)

print(f"Re-identification\n"+\
      f"Posterior vulnerability: {posteriors['reid']:.5f}")

print(f"\nAttribute inference - {sensitive}\n"+\
      f"Posterior vulnerability: {posteriors['ai'][sensitive]:.5f}")
Re-identification
Posterior vulnerability: 0.50000

Attribute inference - income
Posterior vulnerability: 0.75000

Analyzing Individual Vulnerabilities

You can also inspect the distribution of vulnerabilities per record using the distribution=True flag.

posterior_reid, hist_reid = attack.posterior_vulnerability("reid", qids, distribution=True)
print(f"Re-identification - distribution on records\n"+\
      f"{hist_reid}\nMean of the distribution: {np.mean(hist_reid)}")

posterior_reid, hist_ai = attack.posterior_vulnerability("ai", qids, sensitive, distribution=True)
print("\nAttribute inference - distribution on records\n"+\
      f"{sensitive}:\n{hist_ai[sensitive]}\nMean of the distribution: {np.mean(hist_ai[sensitive])}")
Re-identification - distribution on records
[1.         0.5        0.5        0.5        0.5        0.33333333
0.33333333 0.33333333]
Mean of the distribution: 0.5

Attribute inference - distribution on records
income:
[1.         0.5        0.5        1.         1.         0.66666667
0.66666667 0.66666667]
Mean of the distribution: 0.75

Optimized Method with Distributions

The optimized method also supports distributions:

posteriors = attack.posterior_vulnerability("all", qids, sensitive, distribution=True)
posterior_reid, hist_reid = posteriors["reid"]
posteriors_ai, hist_ai = posteriors["ai"]

print("Re-identification histogram\n"+\
      f"{hist_reid}")

print("\nAttribute inference histogram\n"+\
      f"{sensitive}:\n"+\
      f"{hist_ai[sensitive]}")
Re-identification histogram
[1.         0.5        0.5        0.5        0.5        0.33333333
0.33333333 0.33333333]

Attribute inference histogram
income:
[1.         0.5        0.5        1.         1.         0.66666667
0.66666667 0.66666667]

Evaluating Multiple Combinations of QIDs

You can evaluate the vulnerabilities for all combinations of the QIDs (e.g., single attributes, pairs, etc.):

combinations = list(range(1, len(qids)+1))  # Sizes 1 to len(qids)

results_reid = attack.posterior_vulnerability(
    atk="reid",
    qids=qids,
    combinations=combinations,
    n_processes=2
)
display(results_reid)

n_qids

qids

posterior_reid

0

1

age

0.375000000

1

1

education

0.500000000

2

2

age,education

0.500000000

results_ai = attack.posterior_vulnerability(
    atk="ai",
    qids=qids,
    sensitive=sensitive,
    combinations=combinations,
    distribution=True,
    n_processes=2
)
display(results_ai)

n_qids

qids

posterior_income

posterior_income_record

0

1

age

0.750000000

[1.00000000, 0.50000000, 0.50000000, 1.0000000…

1

1

education

0.750000000

[1.00000000, 0.50000000, 0.50000000, 1.0000000…

2

2

age,education

0.750000000

[1.00000000, 0.50000000, 0.50000000, 1.0000000…

You can run both types of attack simultaneously for all combinations:

results = attack.posterior_vulnerability(
    atk="all",
    qids=qids,
    sensitive=sensitive,
    combinations=combinations,
    n_processes=2
)
display(results)

n_qids

qids

posterior_reid

posterior_income

0

1

age

0.375000000

0.750000000

1

1

education

0.500000000

0.750000000

2

2

age,education

0.500000000

0.750000000

This approach provides a comprehensive evaluation of how different combinations of quasi-identifiers affect vulnerability.