ePMA Auto-Coding

A rule-base data processing pipeline to map free text medication descriptions to SNOMED codes using the dm+d standard.

The Problem

The ePMA (electronic prescribing and medicine administration) data collection is a person-level data asset containing information on what medication is prescribed and administered in secondary care settings. The ePMA data that hospitals submit contain a number of free text fields including: medication name, dosage, route and frequency. This creates some challenges when aiming to collect and eventually disseminate this data:

Free text can potentially contain PID (patient identifiable data) which is not permitted
Free text descriptions of medication make analysis hard because the same medication could be described in multiple ways

Our Solution

The ePMA auto-coding pipeline was created which maps the free text descriptions of medications to valid SNOMED codes using dm+d (dictionary of medicines and devices). The pipeline has a series of deterministic steps as follows (high level view only):

Filter submissions - records that have been mapped before or deemed to be unmappable are not run through the pipeline again
Preprocessing - convert to lowercase, replace words and punctuation
Read data - the input ePMA data with dm+d reference data from VTM (chemical), VMP (product), AMP (branded product)
Exact matching - does the input exactly match any of the dm+d reference data?
Entity matching - split the medication description into moiety, unit, strength and dose form and attempt to match to dm+d reference data
Fuzzy matching* - compare the input against the dm+d reference data using fuzzy logic

*Fuzzy logic is a technique using various distance metrics to find out how different two texts are from one another. In it's simplest form it can compare word to word by how many characters need removing, adding or changing (e.g. the fuzzy matching score between "internationalization" and "internationalisation" is 95.00 because one character needs changing. A score of 100 means the words are identical), this technique helps to pick up spelling mistakes and spelling differences. More complex versions of fuzzy matching allow for sentences to be compared, with options to ignore the order of words, duplicated words and partial matches within a sentence.

A sankey diagram showing the logical flow of submissions through the pipeline

Results

A review of the most common prescriptions and administrations found:

93% correct
4% wrong granularity - meaning extra information such as dose and form were provided but autocoding only mapped to the parent level
3% incorrect

Granularity:

Level	Proportion
Virtual Therapeutic Moiety (VTM)	49.1%
Virtual Medicinal Product (VMP)	42.6%
Actual Medicinal Product (AMP)	8.3%

General:

Richer data more useful for analysis and other downstream applications
Pipeline capable of processing many thousands of records at a time
Rules are able to be modified when new supplier systems of ePMA data are submitting

Incorrect matches:

Sometimes the autocoding will make a small number of incorrect matches. Where incorrect matches are discovered during regular reviews, they are excluded from future mappings and the cause is published as part of our known potential issues section, for analysts to be aware of the limitations of the model.

Review process:

The data science team is scheduled to review a sample of the mapping results every six months to re-assess our accuracy information and monitor for model drift.

Outputs

Output	Link
Published repo	Github Repo