Privattacks modules

privattacks.attacks

class privattacks.attacks.Attack(data: Data)[source]

Bases: object

posterior_vulnerability(atk, qids, sensitive=[], distribution=False, combinations: list[int] | None = None, save_file=None, zip_save=False, n_processes=1, return_results=True, verbose=False)[source]

Posterior vulnerability.

Parameters:

atk (str) – Either ‘ai’ for attribute inference attack, ‘reid’ for re-identification or ‘all’ for both attacks.
qids (list[str]) – List of quasi-identifiers.
sensitive (str or Sequence[str], optional) – A single or a list of sensitive attributes for attribute inference attack. Default is [].
distribution (bool, optional) – Whether to return the distribution of posterior vulnerability per record. Default is False.
combinations (list[int]) – Whether to run the attack for different subset of QIDs (instead of only the list of QIDs given in the parameter ‘qids’). It must be provided a list of subset sizes of QIDs. The attack will be run for all subset of QIDs of sizes present in the list.
zip_save (bool, optional) – Save the results in a zip file insteade of csv. Default is False.
save_file (str, optional) – File name to save the results. They will be saved in CSV format. Works only when ‘combinations’ is given.
n_processes (int, optional) – Number of processes to run the method in parallel using multiprocessing package. Default is 1. Works only when ‘combinations’ is given.
return_results (bool, optional) – Whether to return the results or not. Default is True. Works only when ‘combinations’ is given.
verbose (bool, optional) – Show the progress. Default is False. Works only when ‘combinations’ is given.

Returns:

float or (float, list): If distribution is False, returns the posterior vulnerability.

If distribution is True, returns a pair (<posterior vulnerability>, <distribution>). Example of output when distribution is False:

0.75

Example of output when distribution is True:

(0.75, [0.5, 0.5, 1.0, 1.0, 0.75])

if atk == ‘ai’:

dict[str, float] or (dict[str, list]): If distribution is False, returns a dictionary containing the posterior vulnerability for each sensitive attribute. If distribution is True, returns a pair (<posterior vulnerability>, <distribution for each sensitive attribute>). Example of output when distribution is False:

{'disease': 0.3455, 'income':0.7}

Example of ouput when distribution is True:

({'disease': 0.3455, 'income':0.7},
 {'disease': [0.1, 0.1, 0.3, 0.4, 0.8275],
  'income': [0.6, 0.7, 0.7, 0.7, 0.8]})

if atk == ‘all’:

dict: Dictionary with values ‘reid’ and ‘ai’ and their respective posterior vulnerabilities.

if combinations:

vulnerabilities: Pandas DataFrame with posterior vulnerabilities for all combination of n QIDs, where is the sizes provided in the parameter ‘combinations’.

Return type:

if atk == ‘reid’

prior_vulnerability(atk, sensitive=[])[source]

Prior vulnerability.

Parameters:

atk (str) – Either ‘ai’ for attribute inference attack, ‘reid’ for re-identification or ‘all’ for both attacks. Default is [].
sensitive (str or Sequence[str], optional) – A single or a list of sensitive attributes for attribute inference attack.

Returns:

float: Prior vulnerability.

if atk == ‘ai’:: dict[str, float]: Dictionary containing the prior vulnerability for each sensitive attribute (keys are sensitive attribute names and values are posterior vulnerabilities).
if atk == ‘all’:: dict: Dictionary with values ‘reid’ and ‘ai’ and their respective prior vulnerabilities.

Return type:

if atk == ‘reid’

privattacks.data

class privattacks.data.Data(file_name=None, cols=None, cols_to_ignore=None, sep_csv=',', encoding='utf-8', dataframe=None, matrix=None, domains=None, na_values=-1)[source]

Bases: object

A class for handling datasets. The supported formats are ‘csv’, ‘rdata’ and ‘sas7bdat’.

Parameters:

file_name (str, optional) – Dataset file path.
cols (list, optional) – Dataset columns. If not given when given file_name, read all columns in the file.
cols_to_ignore (list, optional) – Columns to ignore in the convertion to integers from 0 to domain_size-1. It must be used for columns with integer values only.
sep_csv (str, optional) – CSV delimiter, default is “,”.
encoding – (str, optional, default ‘utf-8’): Encoding to use for UTF when reading/writing (ex. ‘utf-8’, ‘latin1’).
dataframe (pandas.DataFrame, optional) – Pandas dataframe containing the dataset.
matrix (numpy.ndarray, optional) – Numpy 2d matrix containing the dataset.
domains (dict[str, list], optional) – Domain of columns. If not given, the domains will be taken from data. Keys are column names and values are lists.
na_values (int, optional) – Value to fill missing data (NaN) with, default is -1.

dataset

Numpy matrix of integers.

Type:: numpy.ndarray

n_rows

Number of rows (records) in the dataset.

Type:: int

n_cols

Number of columns (attributes) in the dataset.

Type:: int

cols

List of column names in the dataset. The same order as the dataset matrix.

Type:: list

domains

Column domains. Keys are column names and values are lists. To generate the numpy matrix each original value will be converted to its index in the domain’s list.

Type:: dict[str, list]

col2int(col) → int[source]: Index of a column in the dataset numpy matrix.

df2np(dataframe: DataFrame) → ndarray[source]

Converts a pandas dataframe to a numpy.ndarray. The matrix contains integers in “standard” type, i.e., for all column c, the original values from the domain of c are converted to integers from 0 to size(c). Each original value in a domain will be converted to the respective index the value is in the domain list. The method generates a numpy.ndarray.

Parameters:: dataframe (pandas.DataFrame) – Dataset.
Returns:: Dataset in standard type.
Return type:: dataset (numpy.ndarray)

np2df() → DataFrame[source]

Convert the numpy matrix to the dataset original domains.

Returns: df (pandas.DataFrame): Dataset with original domains.

privattacks.util

privattacks.util.create_histogram(ind_posteriors, bin_size=1) → dict[source]

Generate a histogram of posterior vulnerabilities given partition sizes.

Parameters:

ind_posteriors (-) – Individual posterior vulnerabilties for all records in the dataset.
bin_size (-) – Histogram bin size. For instance, if bin_size=5 then bin 0 = [0, 0.05), bin 2 = [0.05, 0.1), …, bin 19 = [0.95, 1]. Default is 5.

Returns:

A dictionary containing the histogram. Keys are strings (e.g., ‘[0, 0.05)’, ‘[0.95,1]’) and values are the counts of the respective bins.

Return type:

hist (dict)