Building a Tool to Assess the Privacy Risk of Text Data - Extended
Can we generate a modular tool to score the privacy risk of healthcare free-text data using open-source LLMs and NERs.
Tool to Assess Privacy Risk of Text Data - Extended
Warning
This codebase is a proof of concept and is under constant development so should only be used for demonstration purposes within a controlled environment.
Aim
To develop a modular tool to score the privacy risk of healthcare free-text data with open-source tools.
Background
This concept is built upon "Building A Tool to Assess the Privacy Risk of Text Data" article, where ChatGPT-3.5 was used to generate synthetic unstructured data and Amazon's AWS Comprehend Medical was used to extract entitities contributing to re-idetification risk.
In this project we use open source Large Language Models (LLMs) for generating synthetic unstructured data locally and open source named-entity reconginition (NER) models for extracting our entities. The next steps will be to develop a range of experiments so we can then assess the risk associated with various scenarios.
Methodology
The current pipeline has been broken down into 6 components:
- Generating Synthetic Patient Data using Synthea: Synthea-international is an expansion of Synthea, an open-source synthetic patient generator that produces de-identified health records for synthetic patients.
- Generating Synthetic Patient Medical Notes: Utilizes LLama2 to generate synthetic medical notes.
- Re-extracting Entities from the Patient Medical Notes: Utilizes UniversalNER, an open-source generative large language model, specifically trained to extract a range of entities when prompted with a list of entities you want to extract.
- Normalising Entities Extracted for Scoring: This is a standardisation process to ensure that outputs coming out can be consistently assessed, i.e., dates are all formatted the same.
- Scores the Uniqueness of Standardised Entities Extracted: (Py)CorrectMatch is used to extract out information on the uniqueness of individual records, and the global uniquness across a whole dataset, by training a Gaussian Copula model on the data.
- Calculates SHAP Values from Tranformed Data generated from a Fitted Gaussian Copula Model: SHAP (SHapley Additive exPlanations) is used to interpret and explain the results produced from (Py)CorrectMatch, which evaluates the uniqueness of data rows across a whole dataframe.
Results
TBC
Output | Link |
---|---|
Open Source Code | Github |
Open Documentation | Github.io |
Case Study | NA |
Technical report | NA |
Algorithmic Impact Assessment | NA |