Privattacks modules

privattacks.attacks

class privattacks.attacks.Attack(data: Data)[source]

Bases: object

posterior_vulnerability(atk, qids, sensitive=[], distribution=False, combinations: list[int] | None = None, save_file=None, zip_save=False, n_processes=1, return_results=True, verbose=False)[source]

Posterior vulnerability.

Parameters:
  • atk (str) – Either ‘ai’ for attribute inference attack, ‘reid’ for re-identification or ‘all’ for both attacks.

  • qids (list[str]) – List of quasi-identifiers.

  • sensitive (str or Sequence[str], optional) – A single or a list of sensitive attributes for attribute inference attack. Default is [].

  • distribution (bool, optional) – Whether to return the distribution of posterior vulnerability per record. Default is False.

  • combinations (list[int]) – Whether to run the attack for different subset of QIDs (instead of only the list of QIDs given in the parameter ‘qids’). It must be provided a list of subset sizes of QIDs. The attack will be run for all subset of QIDs of sizes present in the list.

  • zip_save (bool, optional) – Save the results in a zip file insteade of csv. Default is False.

  • save_file (str, optional) – File name to save the results. They will be saved in CSV format. Works only when ‘combinations’ is given.

  • n_processes (int, optional) – Number of processes to run the method in parallel using multiprocessing package. Default is 1. Works only when ‘combinations’ is given.

  • return_results (bool, optional) – Whether to return the results or not. Default is True. Works only when ‘combinations’ is given.

  • verbose (bool, optional) – Show the progress. Default is False. Works only when ‘combinations’ is given.

Returns:

float or (float, list): If distribution is False, returns the posterior vulnerability.

If distribution is True, returns a pair (<posterior vulnerability>, <distribution>). Example of output when distribution is False:

0.75

Example of output when distribution is True:

(0.75, [0.5, 0.5, 1.0, 1.0, 0.75])
if atk == ‘ai’:

dict[str, float] or (dict[str, list]): If distribution is False, returns a dictionary containing the posterior vulnerability for each sensitive attribute. If distribution is True, returns a pair (<posterior vulnerability>, <distribution for each sensitive attribute>). Example of output when distribution is False:

{'disease': 0.3455, 'income':0.7}

Example of ouput when distribution is True:

({'disease': 0.3455, 'income':0.7},
 {'disease': [0.1, 0.1, 0.3, 0.4, 0.8275],
  'income': [0.6, 0.7, 0.7, 0.7, 0.8]})
if atk == ‘all’:

dict: Dictionary with values ‘reid’ and ‘ai’ and their respective posterior vulnerabilities.

if combinations:

vulnerabilities: Pandas DataFrame with posterior vulnerabilities for all combination of n QIDs, where is the sizes provided in the parameter ‘combinations’.

Return type:

if atk == ‘reid’

prior_vulnerability(atk, sensitive=[])[source]

Prior vulnerability.

Parameters:
  • atk (str) – Either ‘ai’ for attribute inference attack, ‘reid’ for re-identification or ‘all’ for both attacks. Default is [].

  • sensitive (str or Sequence[str], optional) – A single or a list of sensitive attributes for attribute inference attack.

Returns:

float: Prior vulnerability.

if atk == ‘ai’:

dict[str, float]: Dictionary containing the prior vulnerability for each sensitive attribute (keys are sensitive attribute names and values are posterior vulnerabilities).

if atk == ‘all’:

dict: Dictionary with values ‘reid’ and ‘ai’ and their respective prior vulnerabilities.

Return type:

if atk == ‘reid’

privattacks.data

class privattacks.data.Data(file_name=None, cols=None, cols_to_ignore=None, sep_csv=',', encoding='utf-8', dataframe=None, matrix=None, domains=None, na_values=-1)[source]

Bases: object

A class for handling datasets. The supported formats are ‘csv’, ‘rdata’ and ‘sas7bdat’.

Parameters:
  • file_name (str, optional) – Dataset file path.

  • cols (list, optional) – Dataset columns. If not given when given file_name, read all columns in the file.

  • cols_to_ignore (list, optional) – Columns to ignore in the convertion to integers from 0 to domain_size-1. It must be used for columns with integer values only.

  • sep_csv (str, optional) – CSV delimiter, default is “,”.

  • encoding – (str, optional, default ‘utf-8’): Encoding to use for UTF when reading/writing (ex. ‘utf-8’, ‘latin1’).

  • dataframe (pandas.DataFrame, optional) – Pandas dataframe containing the dataset.

  • matrix (numpy.ndarray, optional) – Numpy 2d matrix containing the dataset.

  • domains (dict[str, list], optional) – Domain of columns. If not given, the domains will be taken from data. Keys are column names and values are lists.

  • na_values (int, optional) – Value to fill missing data (NaN) with, default is -1.

dataset

Numpy matrix of integers.

Type:

numpy.ndarray

n_rows

Number of rows (records) in the dataset.

Type:

int

n_cols

Number of columns (attributes) in the dataset.

Type:

int

cols

List of column names in the dataset. The same order as the dataset matrix.

Type:

list

domains

Column domains. Keys are column names and values are lists. To generate the numpy matrix each original value will be converted to its index in the domain’s list.

Type:

dict[str, list]

col2int(col) int[source]

Index of a column in the dataset numpy matrix.

df2np(dataframe: DataFrame) ndarray[source]

Converts a pandas dataframe to a numpy.ndarray. The matrix contains integers in “standard” type, i.e., for all column c, the original values from the domain of c are converted to integers from 0 to size(c). Each original value in a domain will be converted to the respective index the value is in the domain list. The method generates a numpy.ndarray.

Parameters:

dataframe (pandas.DataFrame) – Dataset.

Returns:

Dataset in standard type.

Return type:

dataset (numpy.ndarray)

np2df() DataFrame[source]

Convert the numpy matrix to the dataset original domains.

Returns

df (pandas.DataFrame): Dataset with original domains.

privattacks.util

privattacks.util.create_histogram(ind_posteriors, bin_size=1) dict[source]

Generate a histogram of posterior vulnerabilities given partition sizes.

Parameters:
  • ind_posteriors (-) – Individual posterior vulnerabilties for all records in the dataset.

  • bin_size (-) – Histogram bin size. For instance, if bin_size=5 then bin 0 = [0, 0.05), bin 2 = [0.05, 0.1), …, bin 19 = [0.95, 1]. Default is 5.

Returns:

A dictionary containing the histogram. Keys are strings (e.g., ‘[0, 0.05)’, ‘[0.95,1]’) and values are the counts of the respective bins.

Return type:

hist (dict)