3. Standardising with Knowledge Graphs
3.1 From Document to Knowledge Graph
3.1.1 Entity Type - Entity Value Relationship Extraction
-
OpenNRE: This is an open-source repository which is used to infer relations from a given sentence.
-
Zett: This is a zero shot entity relation extraction repo where you give the structure you expect the relation to be in and then it extracts out the connecting values from the text.
-
GliREL: You can define the connections between entities using "glirel labels" i.e. you could say diagnosis is "treated with" medication.
-
GoLLie: Zero-shot approach to extracting out entities, where you provide some general relations you expect to see and this can extract the relations between entities.
-
Research Paper Concepts:
-
Generative Type Oriented Named Entity Extraction: A research paper on a generative approach to named entity extraction.
-
Co-attention Network for Joint Entity and Relation Extraction: A research paper on using a co-attention network for joint entity and relation extraction, with provided code.
-
3.1.2 Document to Triplets
-
Text2Graph: This is a pre-trained model on HuggingFace that has been trained by ChatGPT to identify triplets in text.
-
REBEL: This is a pre-trained model on HuggingFace that extracts triplets out from text. (BERT-based model - you would be limited by 512 tokens.)
-
Joint Entity and Relation Extraction: This is a paper outlining the creation of a medically-related dataset to help fine-tune the REBEL model to be better at extracting out medically-related entities.
-
OpenIE Standalone Github Repository: A repository for OpenIE, a tool that extracts entities and their relationships from text.
-
There is an annotation tool called RTE which uses OpenIE to extract out triplets.
3.1.3 Triplets to Graph
Structure:
-
NetworkX: Python package used to create graph data structures.
-
Graph-tools: Python package that provides a number of features for handling directed/undirected graphs and complex networks.
Visualisations:
-
GraphViz: Python packages to visualise graphs.
-
PyVis: Python package to visualise graphs.
-
IGraph: Python package to visualise graphs.
Graph Databases:
-
Neo4J: Community Edition which is free, but commercialised would need to be payed for.
-
JanusGraph: Fully open-source under the Apache 2 license - but it only supports Linux, and data storage requires a cost-based platform.
-
ArangoDB: Community Edition which is free, but commercialised would need to be payed for.
-
OrientDB: Community Edition which is free, but commercialised would need to be payed for.
3.2 Entity Resolution Pipelines
3.2.1 Neo4j
-
Neo4j Entity Resolution Example: A GitHub repository with examples of using Neo4j for entity resolution.
-
Neo4j Whitepaper on Graph Databases: A whitepaper explaining the use of graph databases like Neo4j for various applications, including entity resolution.
-
Neo4j Pipeline: Outlines a process entities can be resolved:
- Coreference Resolution: Replacing all pronouns with the referenced entity.
- NER: Extracting out the named entities from the text provided.
- Entity Disambiguation and Entity Linking: i.e. you could use Wikipedia ID linking - which tries to resolve words that have similar meaning. ("Wikification")
- Co-Occurrence Graphs: This is inferring relationships between a pair of entities based on their presence within a specified unit of text.
- Relationship Extraction:
- Rule-based extraction: use grammatical dependencies to extract relationships out.
- Used a trained NLP model to extract relationships between pairs of entities out.
3.2.2 TigerGraph and Zingg
-
Entity Resolution with TigerGraph: An article discussing how to use TigerGraph and Zingg for entity resolution.
-
Using a Graph Database for Big Data Entity Resolution: A blog post from TigerGraph on using their graph database for big data entity resolution.
-
Zingg Github Repository: The GitHub repository for Zingg, a tool for entity resolution and matching records.
3.2.3 PyJedAI:
-
PyJedAI CleanCleanER: A tutorial for using PyJedAI for entity matching and clustering.
-
PyJedAI Similarity Joins: A tutorial for using PyJedAI for similarity joins in entity resolution.
-
ER Evaluation Framework: A framework for evaluating entity resolution systems.
3.2.4 REBEL + Llama Index:
REBEL extracts triplets from text: This is chunked to ensure REBEL can extract the information out.
3.2.5 KnowledgeGraph
KnowledgeGraph: This demonstrates a framework from going from document to graph - the codebase would likely need reworking.
-
Use Mistral7B OpenOrca hosted by Ollama: For extracting out triplets.
-
NetworkX to make graphs.
-
PyVis to visualise the graphs.
3.2.6 Graph_Maker: Requires GROQ
-
Define your own ontology i.e. your entities and a description of what those entities are.
-
Run the Graph-maker using a large language model to create your graph.
-
Then you can use this graph it has created over your documents.
3.2.7 Instructor:
- Might support Ollama
- You can follow this tutorial but use the ollama implementation.