2. Skip to content

2. Standardising Entity Types and their Values

This GitHub repository posts a lot of material related to named-entity-recognition including methodologies for disambiguation and linking.

2.1 Preprocessing Entity Values

2.1.1 General NLP Cleaning

  • Lowercasing
  • Removing Punctuation
  • Spell Correction
    • PySpellChecker: Uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.
    • TextBlob: Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.
    • autocorrect: Also based on Peter Norvig's work.

2.1.2 Expanding Acronyms

  • Ab3P: Biomedical Specific acronym expansion tool trained on PubMed Abstracts.
  • Spacy AbbrX: Uses pre-trained spacy models to expand acronyms.

2.1.3 Coreference Resolution

Implemented in the Neo4j pipeline, it involves replacing all pronouns with the referenced entity, helping resolve relationships between entities.

2.1.4 Expanding Specific Entity Types

2.1.4.1 Expanding Names

2.1.4.2 Expanding Location

  • LibPostal: Library for parsing and formatting postal addresses.

2.1.5 Formatting Entity Types

2.1.5.1 Dates

  • DateParser: Parses dates into the same format.
  • Microsoft Recognizers-Text: Open-source and can be used locally. Supports multiple languages and various data types as detailed on their GitHub repo readme.

2.1.5.2 Phrases based on Context

2.2 Resolving Entity Values

2.2.1 Entity Disambiguation and Entity Linking

Wikification

  • REL (Radboud Entity Linker): Uses the English Wikipedia as a knowledge source. Maps entities to Wikipedia IDs and normalizes outputs.

  • Wikimapper: Small Python library that maps Wikipedia page titles.

  • spacy_entity_linker: Uses a knowledge base (Wikipedia) to find similar entities.

  • Neo4J (uses Bloom): Resolves entities.

  • BLINK: Facebook's architecture for linking entities to Wikipedia.

  • SpacyFishing: Framework to fetch wiki IDs for extracted entity values.

  • BENT: Open-source repo aimed at resolving entity ambiguity and linking for biomedical terms.

2.2.2 Entity Embedding-based Normalisation Approaches

  • Entity Embed: A PyTorch library for embedding entities into vectors to support scalable record linkage and entity resolution.
  • StarSpace: Facebook tool for embedding entities and link prediction in knowledge bases.
  • Spacy Dependency Parser: Extracts nouns from text, uses knowledge bases, and derives roots from texts.

2.2.3 Medical Code Resolution

2.2.4 String Comparisons between Entities

  • Jellyfish: Library for approximate and phonetic matching of strings.
  • PyStringMatching: Library for string matching.
  • TextDistance: Library for measuring text similarity.
  • StringCompare: Library for comparing string structures.
  • Abydos: Supports phonetic algorithms, string distance metrics, stemmers, and string fingerprints.
  • FuzzyWuzzy: Uses Levenshtein distance to identify close strings.

2.3 Research Paper Concepts