3. Skip to content

3. De-Identification Methods

Deidentification techniques refer to methods used to remove or mask personally identifiable information (PII) from data, while still retaining its utility for analysis or other purposes.

De-Identification Methods

Presido is a de-identification SDK owned by microsoft.

  • Regex to recognise patterns.
  • Use Named-entity recognition model (default is set to en_core_web_lg and supports any Spacy model.)
  • Validating patterns
  • Uses context to increase detection confidence.

AnonCAT is a transformer based approach fo redacting text from electronic health records.

It uses a NER model, en_core_sci_md, to detect all medical terms. Then they assign each entitity to an ID in a biomedical databases (UMLS) to normalise the outputs. (decipher new diagnosis, history, or reason for admission.)

MASK is Manchester University de-identification framework for named-entitity-recognition.

  • BiLSTM layer - essentially considers each of the entities (beginning, in the middle, no entity) - and assigns a confidence value to help determine it's label across.
  • CRFs Layer use the observed data to predict the labels of the sequence, while taking into account the dependencies between neighbouring labels. (Conditional Random Field)
  • GLoVe embeddings - determines a word vector space that incorporates both the local context of words but also their co-occurence with other words across the space.
  • ELMo embeddings - considers where words have the same spelling but different meaning (Polysemy.) - takes the word representations and then take the entire input sentence into equation for calculating the word embeddings.

DLP API is googles API for detection of privacy-sensitive fragments in text, images, and Google Cloud Platform storage repositories.

There is a range of techniques that are implemented in this API, some noticeable ones being:

  • Using basic RegexMatching for some PID data. etc. phone numbers.
  • Using a hotword rule to instruct Sensitive Data Protection to adjust the likelihood of a finding, depending on whether a hotword occurs near that finding.
  • Using exclusion rules to exclude false or unwanted finding (identifying custom substrings within a string.) - for example name inside an email.(Take the email, and not the name)