2. Standardising Entity Types and their Values

This GitHub repository posts a lot of material related to named-entity-recognition including methodologies for disambiguation and linking.

2.1 Preprocessing Entity Values

Lowercasing
Removing Punctuation
Spell Correction
- PySpellChecker: Uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.
- TextBlob: Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.
- autocorrect: Also based on Peter Norvig's work.

Ab3P: Biomedical Specific acronym expansion tool trained on PubMed Abstracts.
Spacy AbbrX: Uses pre-trained spacy models to expand acronyms.

Implemented in the Neo4j pipeline, it involves replacing all pronouns with the referenced entity, helping resolve relationships between entities.

DateParser: Parses dates into the same format.
Microsoft Recognizers-Text: Open-source and can be used locally. Supports multiple languages and various data types as detailed on their GitHub repo readme.

ExtEnD (Extractive Entity Disambiguation): Can be integrated with the spacy framework to extend meanings of words using surrounding context.

Wikification

REL (Radboud Entity Linker): Uses the English Wikipedia as a knowledge source. Maps entities to Wikipedia IDs and normalizes outputs.
Wikimapper: Small Python library that maps Wikipedia page titles.
spacy_entity_linker: Uses a knowledge base (Wikipedia) to find similar entities.
Neo4J (uses Bloom): Resolves entities.
BLINK: Facebook's architecture for linking entities to Wikipedia.
SpacyFishing: Framework to fetch wiki IDs for extracted entity values.
BENT: Open-source repo aimed at resolving entity ambiguity and linking for biomedical terms.

Entity Embed: A PyTorch library for embedding entities into vectors to support scalable record linkage and entity resolution.
StarSpace: Facebook tool for embedding entities and link prediction in knowledge bases.
Spacy Dependency Parser: Extracts nouns from text, uses knowledge bases, and derives roots from texts.

ICD10cm Augmented: John Snow Labs tool for resolving clinical text using ICD-10 codes.
Healthcare Relation Extraction: Notebook demonstrating relation extraction in healthcare data using John Snow Labs tools with ICD-10 codes.
SNOMED CT Entity Linking Challenge: Competition for linking text spans in clinical notes with specific topics in the SNOMED CT clinical terminology.

Jellyfish: Library for approximate and phonetic matching of strings.
PyStringMatching: Library for string matching.
TextDistance: Library for measuring text similarity.
StringCompare: Library for comparing string structures.
Abydos: Supports phonetic algorithms, string distance metrics, stemmers, and string fingerprints.
FuzzyWuzzy: Uses Levenshtein distance to identify close strings.

BIOSYN with Synonym Marginalisation: Method and framework to train and identify synonyms in biomedical text.
Text Normalization Using Encoder–Decoder Networks Based on the Causal Feature Extractor: Research on text normalization.
EL-Chatbot: Paper outlining entity linking using a chatbot, with an associated GitHub repo.