Building a Tool to Assess the Privacy Risk of Text Data - Extended

Can we generate a modular tool to score the privacy risk of healthcare free-text data using open-source LLMs and NERs.

Tool to Assess Privacy Risk of Text Data - Extended

Warning

This codebase is a proof of concept and is under constant development so should only be used for demonstration purposes within a controlled environment.

Figure 1: Diagram of the high-level overview of Privacy Fingerprint using open-source models.

Aim

To develop a modular tool to score the privacy risk of healthcare free-text data with open-source tools.

Background

This concept is built upon "Building A Tool to Assess the Privacy Risk of Text Data" article, where ChatGPT-3.5 was used to generate synthetic unstructured data and Amazon's AWS Comprehend Medical was used to extract entitities contributing to re-idetification risk.

In this project we use open source Large Language Models (LLMs) for generating synthetic unstructured data locally and open source named-entity reconginition (NER) models for extracting our entities. The next steps will be to develop a range of experiments so we can then assess the risk associated with various scenarios.

Methodology

The current pipeline has been broken down into 6 components:

Generating Synthetic Patient Data using Synthea: Synthea-international is an expansion of Synthea, an open-source synthetic patient generator that produces de-identified health records for synthetic patients.
Generating Synthetic Patient Medical Notes: Utilizes LLama2 to generate synthetic medical notes.
Re-extracting Entities from the Patient Medical Notes: Utilizes UniversalNER, an open-source generative large language model, specifically trained to extract a range of entities when prompted with a list of entities you want to extract.
Normalising Entities Extracted for Scoring: This is a standardisation process to ensure that outputs coming out can be consistently assessed, i.e., dates are all formatted the same.
Scores the Uniqueness of Standardised Entities Extracted: (Py)CorrectMatch is used to extract out information on the uniqueness of individual records, and the global uniquness across a whole dataset, by training a Gaussian Copula model on the data.
Calculates SHAP Values from Tranformed Data generated from a Fitted Gaussian Copula Model: SHAP (SHapley Additive exPlanations) is used to interpret and explain the results produced from (Py)CorrectMatch, which evaluates the uniqueness of data rows across a whole dataframe.

Results

TBC

Output	Link
Open Source Code	Github
Open Documentation	Github.io
Case Study	NA
Technical report	NA
Algorithmic Impact Assessment	NA