Skip to content

Considerations for Building a Language Corpus with a Focus on the NHS

What are the available tools that could be used to build an NHS-focussed collection of texts which could help developers build better NLP tools for the healthcare system.

Ingest box containing the logo for scrapy and a screenshot of the NHS.uk website, Enrich box including logos for Helin, brat, and doccan, Share box including huggingface, database. Under the boxes there are the docker, SQLPad, elasticsearch and caddy logos.

Figure 1: Open source tools used in each functional setting

We aimed to explore how to build an Open, Representative, Extensible and Useful set of tools to curate, enrich and share sources of healthcare text data in an appropriate manner.

Results

Whilst a tool stack was developed which achieved many of our objectives, the key learning points were around the knowledge gaps which need to be addressed at both a data and tooling level before bringing these data together becomes achievable.

Output Link
Open Source Code & Documentation Github
Case Study n/a
Blog Here