Synthetic Data Generation and Differential Privacy in R


Synthetic Data Generation and Differential Privacy in R

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used for generating synthetic data in order to address the problem of class imbalance in machine learning. It is a type of data augmentation for the minority class and is used to oversample the minority class by creating synthetic examples. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space, and drawing a new sample at a point along that line. Specifically, a random example from the minority class is chosen, then k of the nearest neighbors for that example are found (typically k=5) and a synthetic example is created at a randomly selected point between the two examples in feature space. The algorithm, first, selects a random example from the minority class and selects random neighbors using K Nearest Neighbors and a synthetic example is created between two examples in the feature space. SMOTE algorithm offers a systematic approach to synthesizing data points that effectively uses the principles of proximity and feature space representation to generate synthetic examples by strategically interpolating between existing minority class samples. By selecting random neighbors using K Nearest Neighbors and creating synthetic instances along the line connecting these neighbors, SMOTE ensures that the generated data points closely resemble the characteristics of the minority class while maintaining the integrity of the original dataset. This approach not only improves the robustness of machine learning models by providing a more balanced training set but also contributes to the overall accuracy and reliability of classification tasks in scenarios where class imbalance is a challenge. There are some variations of SMOTE, such as Adaptive Synthetic Sampling Method, a modification of SMOTE that generates more synthetic examples near the boundary of the minority class. In the litterature SMOTE is presented as a widely used solution for imbalanced data, but it has a drawback. It does not consider the majority class while creating synthetic examples, which can cause issues where there is a strong overlap between the classes. Therefore, the original SMOTE paper suggests combining oversampling (SMOTE) with the undersampling of the majority class, as SMOTE does not consider the majority class while creating new samples. To better underestand the use of this algorithm below is an example code in R for synthetic data generation using SMOTE:

Code Example

library(smotefamily)
data_example = sample_generator(10000,ratio = 0.80)
genData = SMOTE(data_example[,-3],data_example[,3])
genData_2 = SMOTE(data_example[,-3],data_example[,3],K=7)
genData$syn_data
genData_2$syn_data

Code Explanation

  • Load the smotefamily library
  • sample_generator() is a function from smotefamily that generates a synthetic imbalanced dataset.
  • data_example is now a synthetic dataset with 10,000 samples and an 80:20 imbalance between majority and minority classes.
  • SMOTE() is a function used to apply the SMOTE technique to the data, generating new synthetic samples of the minority class.
    • data_example[,-3]: This selects all columns except the 3rd one (i.e., features). In the data_example, the 3rd column contains the target labels (class).
    • data_example[,3]: This selects only the 3rd column, which contains the class labels (the target variable).
  • genData now contains both the original data and the new synthetic minority class samples generated by SMOTE. SMOTE works by creating new instances of the minority class based on the nearest neighbors of existing minority class instances.
  • SMOTE can be applied again, but this time with K=7. The K parameter specifies the number of nearest neighbors to be considered when generating synthetic samples.
    • K=7 means that when creating synthetic minority class samples, the algorithm will look at the 7 nearest neighbors of each minority class sample to create new synthetic data points.
  • genData_2 contains the results of this SMOTE operation with a different number of neighbors used for sample generation.
  • genData$syn_data: This extracts the synthetic data generated by SMOTE from the genData object.
  • genData_2$syn_data: This extracts the synthetic data generated by SMOTE from the genData_2 object, where a different K value was used.

Although originally developed for improving classification performance on imbalanced medical datasets, SMOTE has found applications beyond its initial domain. One notable application is in synthetic data generation for medical purposes. For example, one study used SMOTE to generate synthetic data for predictive modeling in low- to middle-income countries, demonstrating its versatility across different healthcare settings. Another study utilized SMOTE to create high-fidelity synthetic patient data, which was instrumental in assessing machine learning healthcare software. Furthermore, researchers have proposed innovative variations of SMOTE, such as SMOTE-ENC, designed specifically to generate synthetic data for both nominal and continuous features in imbalanced medical datasets.

SMOTE has some limitations, such as the potential creation of synthetic samples that closely resemble existing minority class samples, potentially resulting in a model that struggles to generalize to unseen data. This happens when synthetic samples are generated far from the decision boundary, leading to a less discriminative model and ultimately impacting performance negatively. Additionally, in datasets where class distributions overlap in the feature space, SMOTE may introduce noise and blur the decision boundary, further complicating classification tasks and reducing model accuracy. Moreover, the computational complexity of SMOTE is a practical challenge, particularly for large datasets. The algorithm’s reliance on the k-nearest neighbors approach can result in significant computational increase, making it less scalable and more time-consuming, especially in resource-constrained environments. Furthermore, the choice of the parameter k, which determines the number of nearest neighbors considered during sampling, significantly influences the quality of synthetic data generated by SMOTE. Selecting an inappropriate value for k can lead to suboptimal results, highlighting the importance of parameter tuning and careful consideration when implementing SMOTE in practice. These drawbacks underscore the need for a nuanced understanding of SMOTE’s limitations and the importance of selecting appropriate techniques tailored to the specific characteristics of the dataset and problem at hand.

As mentioned before, Differential privacy is a framework for ensuring data privacy when sharing analysis results with third parties. Its popularity comes from a set of general mechanisms that can privatize various non-private data functions such as statistics, estimation procedures, and learners. The R package DPpack offers a comprehensive toolkit for performing differentially private analyses. The current version of DPpack includes three widely-used mechanisms for ensuring differential privacy: Laplace, Gaussian, and exponential. In addition, DPpack provides a range of privacy-preserving descriptive statistical functions, such as mean, variance, covariance, quantiles, histograms, and contingency tables. It also features user-friendly implementations of privacy-preserving logistic regression, support vector machines (SVM), and linear regression, along with differentially private hyperparameter tuning for these models. This broad array of differentially private statistics and models allows users to easily apply differential privacy principles to routine statistical analyses. Future development of DPpack aims to expand its capabilities by incorporating more differentially private machine learning techniques, statistical modeling, and inference methods.

Code Example

install.packages("DPpack")
library("DPpack")
n <- 100 
c0 <- 5 
c1 <- 10 
D <- runif(n, c0, c1) 
f <- function(D) c(mean(D), var(D)) 
sensitivities <- c((c1-c0)/n, (c1-c0)^2/n)
epsilon <- 1
private.vals <- LaplaceMechanism(f(D), epsilon, sensitivities) 
cat("Privacy preserving values: ", private.vals, "\nTrue values: ", f(D))
private.vals <- LaplaceMechanism(f(D), epsilon, sensitivities, alloc.proportions = c(0.25, 0.75)) 
cat("Privacy preserving values: ", private.vals, "\nTrue values: ", f(D))
  • Initialization of Variables
    • n is set to 100, representing the number of data points.
    • c0 and c1 are set to 5 and 10, respectively. These values are the bounds for generating random numbers, defining the range [5, 10].
  • Generating Data
    • D is a vector of n (100) uniformly distributed random numbers between c0 (5) and c1 (10).
  • Defining a Function f
    • f is a function that takes the dataset D as input and returns a vector containing the mean and variance of D.
  • Calculating Sensitivities
    • sensitivities is a vector containing the sensitivities of the mean and variance functions.
    • Sensitivity is a measure of how much the output of a function can change when a single data point in the dataset is modified. Here:
      • The sensitivity of the mean is calculated as (c1-c0)/n.
      • The sensitivity of the variance is calculated as (c1-c0)^2/n.
  • Setting Privacy Parameter (Epsilon)
    • epsilon is set to 1, which is a parameter that controls the privacy level. Lower values of epsilon indicate stronger privacy guarantees.
  • Applying the Laplace Mechanism
    • LaplaceMechanism is applied to the function f(D). This mechanism adds noise drawn from the Laplace distribution to the true values (mean and variance) to ensure differential privacy.
    • private.vals contains the privacy-preserving values of the mean and variance after noise has been added.
  • Displaying Results
    • This line prints the privacy-preserving values (with added noise) and the true values (without noise) for comparison.
  • Splitting the Privacy Budget
    • Here, the privacy budget epsilon is split between the mean and variance calculations.
    • alloc.proportions = c(0.25, 0.75) means that 25% of epsilon is used for the mean and 75% for the variance.
    • The LaplaceMechanism is then reapplied with this adjusted privacy budget allocation.
  • Displaying Results Again
    • This line prints the new privacy-preserving values after the budget has been split, along with the true values.
    • The code demonstrates how to apply differential privacy to statistical calculations (mean and variance) using the Laplace mechanism.
    • The privacy budget (epsilon) is adjusted to provide different levels of privacy for the mean and variance.
    • The sensitivities of the mean and variance are calculated to determine the appropriate amount of noise to add.
D <- rnorm(500, mean=3, sd=2) 
lower.bound =-3 # 3 standard deviations below mean 
upper.bound = 9 # 3 standard deviations above mean


private.mean <- meanDP(D, 1, lower.bound, upper.bound) 
cat("Privacy preserving mean: ", private.mean, "\nTrue mean: ", mean(D)) 

 private.var <- varDP(D, 0.5, lower.bound, upper.bound, which.sensitivity = "unbounded", mechanism = "Gaussian", delta = 0.01) 
 cat("Privacy preserving variance: ", private.var, "\nTrue variance: ", var(D)) 
 
private.sd <- sdDP(D, 0.5, lower.bound, upper.bound, mechanism="Gaussian", delta=0.01, type.DP="pDP") 
cat("Privacy preserving standard deviation: ", private.sd, "\nTrue standard deviation: ", sd(D)) 
  • Generating Data
    • D is a dataset of 500 random numbers generated from a normal distribution with a mean of 3 and a standard deviation of 2.
  • Defining Bounds
    • lower.bound is set to -3, which is 3 standard deviations below the mean (3 - 3*2 = -3).
    • upper.bound is set to 9, which is 3 standard deviations above the mean (3 + 3*2 = 9).
  • Differentially Private Mean
    • meanDP computes a differentially private estimate of the mean of D.
    • Parameters:
      • D: The dataset.
      • 1: The privacy budget epsilon (with a value of 1).
      • lower.bound and upper.bound: The bounds within which the data lies.
    • private.mean stores the privacy-preserving mean.
  • Differentially Private Variance
    • varDP computes a differentially private estimate of the variance of D.
    • Parameters:
      • D: The dataset.
      • 0.5: The privacy budget epsilon (with a value of 0.5).
      • lower.bound and upper.bound: The bounds within which the data lies.
      • which.sensitivity = "unbounded": Indicates that the sensitivity is computed assuming no bounds on the data.
      • mechanism = "Gaussian": Specifies the use of the Gaussian mechanism (as opposed to the Laplace mechanism).
      • delta = 0.01: A parameter for the Gaussian mechanism related to the probability of breaching privacy.
    • private.var stores the privacy-preserving variance.
  • Differentially Private Standard Deviation
    • sdDP computes a differentially private estimate of the standard deviation of D.
    • Parameters:
      • D: The dataset.
      • 0.5: The privacy budget epsilon.
      • lower.bound and upper.bound: The bounds within which the data lies.
      • mechanism = "Gaussian": Specifies the use of the Gaussian mechanism.
      • delta = 0.01: A parameter for the Gaussian mechanism.
      • type.DP = "pDP": Specifies the type of differential privacy, in this case, “pure Differential Privacy” (pDP).
    • private.sd stores the privacy-preserving standard deviation.
D1 <- sort(rnorm(500, mean=3, sd=2))
D2 <- sort(rnorm(500, mean=-1, sd=0.5)) 
lb1 <--3 
# 3 std devs below mean 
lb2 <--2.5 
# 3 std devs below mean
ub1 <- 9 
# 3 std devs above mean 
ub2 <- .5 
# 3 std devs above mean


private.cov <- covDP(D1, D2, 1, lb1, ub1, lb2, ub2) 

cat("Privacy preserving covariance: ", private.cov, "\nTrue covariance: ", cov(D1, D2))


D3 <- sort(rnorm(200, mean=3, sd=2)) 
D4 <- sort(rnorm(200, mean=-1, sd=0.5)) 
M1 <- matrix(c(D1, D2), ncol=2) 
M2 <- matrix(c(D3, D4), ncol=2)
private.pooled.cov <- pooledCovDP(M1, M2, eps = 1, lower.bound1 = lb1, lower.bound2 = lb2, upper.bound1 = ub1, upper.bound2 = ub2)

  • Generating Data for Covariance
    • D1 and D2 are two datasets each containing 500 random numbers generated from normal distributions.
      • D1 is generated with a mean of 3 and a standard deviation of 2.
      • D2 is generated with a mean of -1 and a standard deviation of 0.5.
  • Defining Bounds for Covariance
    • lb1 and ub1 are the lower and upper bounds for D1, calculated as 3 standard deviations below and above the mean, respectively.
    • lb2 and ub2 are the lower and upper bounds for D2, also based on 3 standard deviations below and above the mean.
  • Calculating Differentially Private Covariance
    • covDP computes a differentially private estimate of the covariance between D1 and D2.
    • Parameters:
      • D1, D2: The datasets between which covariance is calculated.
      • 1: The privacy budget epsilon (with a value of 1).
      • lb1, ub1, lb2, ub2: The lower and upper bounds for D1 and D2.
    • private.cov stores the privacy-preserving covariance.
  • Generating Additional Data for Pooled Covariance
    • D3 and D4 are additional datasets, each containing 200 random numbers generated from normal distributions.
      • D3 has a mean of 3 and a standard deviation of 2.
      • D4 has a mean of -1 and a standard deviation of 0.5.
    • M1 is a matrix with two columns: D1 and D2.
    • M2 is a matrix with two columns: D3 and D4.
  • Calculating Differentially Private Pooled Covariance
    • pooledCovDP computes a differentially private estimate of the pooled covariance matrix between the two matrices M1 and M2.
    • Parameters:
      • M1, M2: The matrices containing the datasets for which pooled covariance is calculated.
      • eps = 1: The privacy budget epsilon.
      • lower.bound1, upper.bound1: The lower and upper bounds for the first column of data in the matrices.
      • lower.bound2, upper.bound2: The lower and upper bounds for the second column of data in the matrices.
    • private.pooled.cov stores the privacy-preserving pooled covariance.

References