Skip to content

Investigating Differential Privacy in a Variational AutoEncoder for Synthetic Data Generation

The initial creation of a variational autoencoder with differential privacy for generating single table tabular gaussian data. This work demonstrated the feasibility of this approach for healthcare and fed into further interactions of the code base.

Diagram showing the flow of data in a variational autoencoder. Starts with an input labeled as "x" on the left, which is passed through an NN arrow. This encoder outputs two posterior parameters: μ and variance σ². These are then used to sample a latent variable "z" from a normal distribution z ~ N(μ, σ²). "z" is passed to an NN decoder, which reconstructs the input, producing the reconstruction labeled as "x̂" on the right. Figure 1: Schematic Representation of a Variational Autoencoder

This project investigates the potential suitability of Variational Autoencoders (VAEs) as a synthetic data generation tool in the context of the NHS. To effectively address this direction, this work focussed on four key aspects: quality, privacy, ease of use, and interpretability.

We evaluate the performance of the VAE approach alongside five alternative methods available in July/August 2021, namely Gaussian Copula, CTGAN, CopulaGAN, SDV’s TVAE and Independent (a model which assumes independence across variables). Evaluating this set of models provides context to the performance of the VAE with respect to both basic (e.g. Independent) and complex (e.g. CTGAN) approaches.

We then tested how the metrics and visualisations changed when differential privacy was incorporated into the variational autoencoder as a function of differential levels of privacy (increased privacy budget).

Results

We found that a variational autoencoder could indeed generate medium to high fidelity synthetic data for a single tabular table with numerical and categorical gaussian variables.

As the privacy budget increases, we see the quality decrease as expected. However, the level of privacy increase associated with increasing the privacy budget appears not to have a direct correlation. This warrants further work as this might be down to implementation, the metrics being used for evaluation or may point to a feature of the VAE not incorporating the differential privacy correctly.

Output Link
Open Source Code & Documentation Github
Case Study Awaiting Sign-Off
Technical report Here