Getting Started =============== This guide walks you through a basic usage example of the `privattacks` Python package, showing how to load data, define quasi-identifiers, and evaluate both re-identification and attribute inference vulnerabilities. Data Preparation ---------------- We begin by creating a simple synthetic dataset using `pandas`. .. code-block:: python import pandas as pd import numpy as np import privattacks df = pd.DataFrame({ "age":[20,30,30,30,30,55,55,55], "education":["Master", "High School", "High School", "PhD", "PhD", "Bachelor", "Bachelor", "Bachelor"], "income":["low", "medium", "low", "medium", "medium", "high", "high", "medium"] }) display(df) .. list-table:: :header-rows: 1 :widths: 10 20 10 * - age - education - income * - 20 - Master - low * - 30 - High School - medium * - 30 - High School - low * - 30 - PhD - medium * - 30 - PhD - medium * - 55 - Bachelor - high * - 55 - Bachelor - high * - 55 - Bachelor - medium This dataset contains three columns: - `age` and `education` are considered *quasi-identifiers (QIDs)*. - `income` is treated as a *sensitive attribute*. Defining QIDs and Sensitive Attribute ------------------------------------- .. code-block:: python qids = ["age", "education"] sensitive = "income" # It's possible to run the attack for a list of sensitive attributes These variables are passed to the `privattacks` data wrapper and attack engine. .. code-block:: python data = privattacks.data.Data(dataframe=df) attack = privattacks.attacks.Attack(data) .. note:: The dataset can be read directly from a file, see :mod:`privattacks.data`. Evaluating Prior and Posterior Vulnerabilities ---------------------------------------------- We first calculate the prior and posterior vulnerabilities for: - Re-identification attacks - Attribute inference attacks .. code-block:: python prior_reid = attack.prior_vulnerability("reid") prior_ai = attack.prior_vulnerability("ai", sensitive) posterior_reid = attack.posterior_vulnerability("reid", qids) posterior_ai = attack.posterior_vulnerability("ai", qids, sensitive) print(f"Re-identification\n"+\ f"Prior vulnerability; {prior_reid:.5f}\n"+\ f"Posterior vulnerability: {posterior_reid:.5f}") print(f"\nAttribute inference - {sensitive}\n"+\ f"Prior vulnerability; {prior_ai[sensitive]:.5f}\n"+\ f"Posterior vulnerability: {posterior_ai[sensitive]:.5f}") .. code-block:: python Re-identification Prior vulnerability; 0.12500 Posterior vulnerability: 0.50000 Attribute inference - income Prior vulnerability; 0.500000 Posterior vulnerability: 0.75000 This provides an initial assessment of the risk posed by attackers with and without auxiliary information (quasi-identifiers). Using the Optimized Evaluation Method ------------------------------------- For convenience and performance, you can run both attacks in a single call: .. code-block:: python posteriors = attack.posterior_vulnerability("all", qids, sensitive) print(f"Re-identification\n"+\ f"Posterior vulnerability: {posteriors['reid']:.5f}") print(f"\nAttribute inference - {sensitive}\n"+\ f"Posterior vulnerability: {posteriors['ai'][sensitive]:.5f}") .. code-block:: Re-identification Posterior vulnerability: 0.50000 Attribute inference - income Posterior vulnerability: 0.75000 Analyzing Individual Vulnerabilities ------------------------------------ You can also inspect the distribution of vulnerabilities per record using the `distribution=True` flag. .. code-block:: python posterior_reid, hist_reid = attack.posterior_vulnerability("reid", qids, distribution=True) print(f"Re-identification - distribution on records\n"+\ f"{hist_reid}\nMean of the distribution: {np.mean(hist_reid)}") posterior_reid, hist_ai = attack.posterior_vulnerability("ai", qids, sensitive, distribution=True) print("\nAttribute inference - distribution on records\n"+\ f"{sensitive}:\n{hist_ai[sensitive]}\nMean of the distribution: {np.mean(hist_ai[sensitive])}") .. code-block:: python Re-identification - distribution on records [1. 0.5 0.5 0.5 0.5 0.33333333 0.33333333 0.33333333] Mean of the distribution: 0.5 Attribute inference - distribution on records income: [1. 0.5 0.5 1. 1. 0.66666667 0.66666667 0.66666667] Mean of the distribution: 0.75 Optimized Method with Distributions ----------------------------------- The optimized method also supports distributions: .. code-block:: python posteriors = attack.posterior_vulnerability("all", qids, sensitive, distribution=True) posterior_reid, hist_reid = posteriors["reid"] posteriors_ai, hist_ai = posteriors["ai"] print("Re-identification histogram\n"+\ f"{hist_reid}") print("\nAttribute inference histogram\n"+\ f"{sensitive}:\n"+\ f"{hist_ai[sensitive]}") .. code-block:: python Re-identification histogram [1. 0.5 0.5 0.5 0.5 0.33333333 0.33333333 0.33333333] Attribute inference histogram income: [1. 0.5 0.5 1. 1. 0.66666667 0.66666667 0.66666667] Evaluating Multiple Combinations of QIDs ---------------------------------------- You can evaluate the vulnerabilities for *all combinations* of the QIDs (e.g., single attributes, pairs, etc.): .. code-block:: python combinations = list(range(1, len(qids)+1)) # Sizes 1 to len(qids) results_reid = attack.posterior_vulnerability( atk="reid", qids=qids, combinations=combinations, n_processes=2 ) display(results_reid) .. list-table:: :header-rows: 1 :widths: 5 15 20 20 * - - n_qids - qids - posterior_reid * - 0 - 1 - age - 0.375000000 * - 1 - 1 - education - 0.500000000 * - 2 - 2 - age,education - 0.500000000 .. code-block:: python results_ai = attack.posterior_vulnerability( atk="ai", qids=qids, sensitive=sensitive, combinations=combinations, distribution=True, n_processes=2 ) display(results_ai) .. list-table:: :header-rows: 1 :widths: 5 15 20 20 35 * - - n_qids - qids - posterior_income - posterior_income_record * - 0 - 1 - age - 0.750000000 - [1.00000000, 0.50000000, 0.50000000, 1.0000000... * - 1 - 1 - education - 0.750000000 - [1.00000000, 0.50000000, 0.50000000, 1.0000000... * - 2 - 2 - age,education - 0.750000000 - [1.00000000, 0.50000000, 0.50000000, 1.0000000... You can run both types of attack simultaneously for all combinations: .. code-block:: python results = attack.posterior_vulnerability( atk="all", qids=qids, sensitive=sensitive, combinations=combinations, n_processes=2 ) display(results) .. list-table:: :header-rows: 1 :widths: 5 15 20 20 20 * - - n_qids - qids - posterior_reid - posterior_income * - 0 - 1 - age - 0.375000000 - 0.750000000 * - 1 - 1 - education - 0.500000000 - 0.750000000 * - 2 - 2 - age,education - 0.500000000 - 0.750000000 This approach provides a comprehensive evaluation of how different combinations of quasi-identifiers affect vulnerability.