3. De-Identification Methods

Deidentification techniques refer to methods used to remove or mask personally identifiable information (PII) from data, while still retaining its utility for analysis or other purposes.

De-Identification Methods

PresidoAnonCATMASKCloud Data Loss Prevention API

Presido is a de-identification SDK owned by microsoft.

Regex to recognise patterns.
Use Named-entity recognition model (default is set to en_core_web_lg and supports any Spacy model.)
Validating patterns
Uses context to increase detection confidence.

AnonCAT is a transformer based approach fo redacting text from electronic health records.

It uses a NER model, en_core_sci_md, to detect all medical terms. Then they assign each entitity to an ID in a biomedical databases (UMLS) to normalise the outputs. (decipher new diagnosis, history, or reason for admission.)

MASK is Manchester University de-identification framework for named-entitity-recognition.

BiLSTM layer - essentially considers each of the entities (beginning, in the middle, no entity) - and assigns a confidence value to help determine it's label across.
CRFs Layer use the observed data to predict the labels of the sequence, while taking into account the dependencies between neighbouring labels. (Conditional Random Field)
GLoVe embeddings - determines a word vector space that incorporates both the local context of words but also their co-occurence with other words across the space.
ELMo embeddings - considers where words have the same spelling but different meaning (Polysemy.) - takes the word representations and then take the entire input sentence into equation for calculating the word embeddings.

DLP API is googles API for detection of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories.

There is a range of techniques that are implemented in this API, some noticeable ones being:

Using basic RegexMatching for some PID data. etc. phone numbers.
Using a hotword rule to instruct Sensitive Data Protection to adjust the likelihood of a finding, depending on whether a hotword occurs near that finding.
Using exclusion rules to exclude false or unwanted finding (identifying custom substrings within a string.) - for example name inside an email.(Take the email, and not the name)