3. Skip to content

3. Standardising with Knowledge Graphs

3.1 From Document to Knowledge Graph

3.1.1 Entity Type - Entity Value Relationship Extraction

  1. KG-Completion from the Graph4NLP codebase

  2. OpenNRE: This is an open-source repository which is used to infer relations from a given sentence.

  3. Zett: This is a zero shot entity relation extraction repo where you give the structure you expect the relation to be in and then it extracts out the connecting values from the text.

  4. GliREL: You can define the connections between entities using "glirel labels" i.e. you could say diagnosis is "treated with" medication.

  5. GoLLie: Zero-shot approach to extracting out entities, where you provide some general relations you expect to see and this can extract the relations between entities.

  6. Research Paper Concepts:

3.1.2 Document to Triplets

  1. Text2Graph: This is a pre-trained model on HuggingFace that has been trained by ChatGPT to identify triplets in text.

  2. REBEL: This is a pre-trained model on HuggingFace that extracts triplets out from text. (BERT-based model - you would be limited by 512 tokens.)

  3. Joint Entity and Relation Extraction: This is a paper outlining the creation of a medically-related dataset to help fine-tune the REBEL model to be better at extracting out medically-related entities.

  4. OpenIE Standalone Github Repository: A repository for OpenIE, a tool that extracts entities and their relationships from text.

  5. There is an annotation tool called RTE which uses OpenIE to extract out triplets.

3.1.3 Triplets to Graph

Structure:

  1. NetworkX: Python package used to create graph data structures.

  2. Graph-tools: Python package that provides a number of features for handling directed/undirected graphs and complex networks.

Visualisations:

  1. GraphViz: Python packages to visualise graphs.

  2. PyVis: Python package to visualise graphs.

  3. IGraph: Python package to visualise graphs.

Graph Databases:

  1. Neo4J: Community Edition which is free, but commercialised would need to be payed for.

  2. JanusGraph: Fully open-source under the Apache 2 license - but it only supports Linux, and data storage requires a cost-based platform.

  3. ArangoDB: Community Edition which is free, but commercialised would need to be payed for.

  4. OrientDB: Community Edition which is free, but commercialised would need to be payed for.

3.2 Entity Resolution Pipelines

3.2.1 Neo4j

  1. Neo4j Entity Resolution Example: A GitHub repository with examples of using Neo4j for entity resolution.

  2. Neo4j Whitepaper on Graph Databases: A whitepaper explaining the use of graph databases like Neo4j for various applications, including entity resolution.

  3. Neo4j Pipeline: Outlines a process entities can be resolved:

    • Coreference Resolution: Replacing all pronouns with the referenced entity.
    • NER: Extracting out the named entities from the text provided.
    • Entity Disambiguation and Entity Linking: i.e. you could use Wikipedia ID linking - which tries to resolve words that have similar meaning. ("Wikification")
    • Co-Occurrence Graphs: This is inferring relationships between a pair of entities based on their presence within a specified unit of text.
    • Relationship Extraction:
      • Rule-based extraction: use grammatical dependencies to extract relationships out.
      • Used a trained NLP model to extract relationships between pairs of entities out.

3.2.2 TigerGraph and Zingg

  1. Entity Resolution with TigerGraph: An article discussing how to use TigerGraph and Zingg for entity resolution.

  2. Using a Graph Database for Big Data Entity Resolution: A blog post from TigerGraph on using their graph database for big data entity resolution.

  3. Zingg Github Repository: The GitHub repository for Zingg, a tool for entity resolution and matching records.

3.2.3 PyJedAI:

  1. PyJedAI CleanCleanER: A tutorial for using PyJedAI for entity matching and clustering.

  2. PyJedAI Similarity Joins: A tutorial for using PyJedAI for similarity joins in entity resolution.

  3. ER Evaluation Framework: A framework for evaluating entity resolution systems.

3.2.4 REBEL + Llama Index:

REBEL extracts triplets from text: This is chunked to ensure REBEL can extract the information out.

3.2.5 KnowledgeGraph

KnowledgeGraph: This demonstrates a framework from going from document to graph - the codebase would likely need reworking.

  1. Use Mistral7B OpenOrca hosted by Ollama: For extracting out triplets.

  2. NetworkX to make graphs.

  3. PyVis to visualise the graphs.

3.2.6 Graph_Maker: Requires GROQ

  1. Define your own ontology i.e. your entities and a description of what those entities are.

  2. Run the Graph-maker using a large language model to create your graph.

  3. Then you can use this graph it has created over your documents.

  4. Tutorial

3.2.7 Instructor:

  1. Might support Ollama
  2. You can follow this tutorial but use the ollama implementation.