Skip to content

Developing our SynthVAE code

Improving our variational autoencoder to consider fairness and to run on non-gaussian distributions

Continuation of the previous development of our variational autoencoder (VAE) to correct for an error discovered since the last project finished. This error appears when trying to generate data for continuous variables which follow non-Gaussian distributions. Previously, standard scaling had been used to normalise these variables which was causing the non-gaussian variables to be synthesised poorly. This was replaced with a Gaussian mixture model from the RDT python library to scale and transform these variables into ones with a Gaussian distribution.

The second phase of this worked focussed on understanding the different ways of measuring and implementing fairness within the synthetic data.

Results

The gaussian mixture model was able to cope with non-gaussian variables thus extending the range of datasets which we can generate from greatly. Additional hyper-parameter tuning and general coding improvements have increased the reusability and performance of the code.

Regarding fairness, there are many metrics to choose from and to make the situation more complex, not all metrics are compatible with one another, i.e. you might be able to satisfy an equal odds metric for a group but not an equal prediction for the same group. This means that the level of fairness required is project specific and has to be re-evaluated depending on the research needs.

Further work will explore the adaption of directed acyclic graphs to control for fairness and the impact this has on quality and privacy.

Output Link
Open Source Code & Documentation Github
Case Study Awaiting Sign-Off
Technical report Here