Mastering GraphRAG with Graph Studio and LangChain — Preserving Semantics Every Step of the Way
Leveraging semantically rich graphs for LLM grounding — How to safeguard against losing meaning
The rise of Generative AI has revolutionized industries, and yet LLM-based solutions still struggle with accuracy, often producing incorrect or misleading results due to their reliance on statistical patterns rather than true domain understanding. Knowledge Graphs offer a powerful way to enhance LLM-based search by integrating curated, domain-specific information. This not only improves result quality but also reduces costs associated with model fine-tuning, human fact-checking, and error correction.
Agentic workflows, which aim for greater AI autonomy and seamless interaction, face similar challenges as AI assistants but demand even higher accuracy. While numerous open-source frameworks exist for building such workflows — including LangChain, LlamaIndex, Atomic Agents, AutoGen, Guidance, CrewAI, and Semantic Kernel — the role of Knowledge Graphs remains crucial. By grounding LLMs with structured, verified knowledge, Knowledge Graphs help ensure more reliable and context-aware AI-driven solutions.
TL;DR
This is a simple guide to GraphRAG with Graph Studio. If you want the shortest possible advice on this I guess it would be this: Avoid using SPARQL SELECT queries when accessing data in the Knowledge Graph… unless you enjoy watching your LLM guess why the method is called GraphRAG ;)
Altair Graph Studio (AGS) is a full stack Knoweldge Graph platform that is used in various enterprise deployments as a foundation for LLM based applications. In this example we will show how Graph Studio can be used in such a context using LangChain and OpenAI. Please note that the focus of this tutorial is to help users working with popular GenAI frameworks or other python based environments using Graph Studio as a knowledge base.
There are three interesting ways to leverage Knowledge Graphs as a foundation for GenAI enhanced applications:
- Enhance LLM prompts by semantically enriched knowledge in the Graph, a methodology frequently referred to as GraphRAG. This is what we will discuss in this article.
- Talk to your Knowledge Graphs using the LLM as a natural language to SPARQL translator (Link to Altair CoPilot demo). This approach uses a version of GraphRAG by retrieving ontologies to allow LLMs to perform the translation task with more precision based on facts.
- Use the Knowledge Graphs as a fact-base for generating fine-tuning data. Details outlined in my previous article: https://medium.com/@shalumov-boris/in-an-era-where-ai-strategy-hinges-on-data-quality-knowledge-graphs-arent-optional-1ca38d914ebf
So how to do GraphRAG in AGS? This article will guide you through a few examples using Python. You need an Graph Studio instance up an running or just follow the guide for a conceptual understanding of the approach.
We are going to use this RDF Knowledge Graph built in Graph Studio for our example. It has information about tubes as well as respective BOM and sensor information.
Before we get started
You need a Python 3.11.9+ environment and a requirements.txt with the following content. Run pip install -r requirements.txt
.
urllib3
python-dotenv
rdflib
langchain
langchain-openai
pyanzo
pandas
Create a config.json template and populate according to your connection details and required namespaces to replace prefixes later
{
"connection": {
"domain": "copilot.anzo-solutions.com",
"port": "443",
"username": "boris.shalumov",
"password": "xxx",
"graphmart": "http://altair.com/Graphmart/4a7e0c0420e1401c870e10f6e4f78316"
},
"namespaces": {
"mfg": "http://altair.com/ontologies/MFG_Tubes#",
"tube": "http://altair.com/MFG_Tubes/Tube/"
}
}
We can leverage rdflib and pyanzo to define a few essential functions for our GraphRAG application.
import json
import logging
import pandas as pd
from typing import Dict
from pyanzo import AnzoClient
def load_json(file_path: str) -> Dict:
"""Loads a JSON file."""
try:
with open(file_path, 'r') as file:
return json.load(file)
except Exception as e:
logging.error(f"Error loading JSON file at {file_path}: {e}")
raise
def load_query(query_file: str) -> str:
"""Loads a query string from a file."""
try:
with open(query_file, 'r') as file:
return file.read()
except Exception as e:
logging.error(f"Error loading query file at {query_file}: {e}")
raise
def initialize_client(config: Dict) -> AnzoClient:
"""Initializes the Anzo client."""
return AnzoClient(
domain=config['connection']['domain'],
port=config['connection']['port'],
username=config['connection']['username'],
password=config['connection']['password']
)
def query_graphmart_SELECT(client: AnzoClient, graphmart: str, query: str) -> pd.DataFrame:
"""Runs a graphmart query and returns the results as a DataFrame."""
try:
results = client.query_graphmart(graphmart, query)
return pd.DataFrame(results.as_table_results().as_record_dictionaries())
except Exception as e:
logging.error(f"Error querying graphmart: {e}")
raisedef query_graphmart_DESCRIBE_CONSTRUCT(client: AnzoClient, graphmart: str, query: str, result_format: str = 'json'):
"""Runs a graphmart DESCRIBE query and returns the results in the specified format."""
try:
results = client.query_graphmart(graphmart, query)
if result_format == 'json':
return results.as_quad_store().as_anzo_json_list()
elif result_format == 'records':
return results.as_quad_store().as_record_dictionaries()
elif result_format == 'rdf':
return results.as_quad_store().as_rdflib_graph().serialize(format='json-ld')
else:
raise ValueError(f"Unsupported result format: {result_format}")
except Exception as e:
logging.error(f"Error querying graphmart: {e}")
raisedef get_namespaces(config: Dict) -> Dict[str, str]:
"""Loads namespaces from a config file."""
namespaces = {}
for prefix, uri in config['namespaces'].items():
namespaces[prefix] = uri
return namespaces
def replace_prefixes(df, namespaces):
for col in df.columns:
df[col] = df[col].apply(lambda x: next((x.replace(ns, f"{prefix}:") for prefix, ns in namespaces.items() if isinstance(x, str) and x.startswith(ns)), x))
return df
Let’s start!
Import required libraries and disable warnings. Then create the connection by pulling details from your config.json.
import os
import json
import urllib3
from dotenv import load_dotenv
from rdflib import Namespace
from functions import *
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOpenAI
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create connection and load config
config = load_json('config.json')
anzo_client = initialize_client(config)
namespaces = get_namespaces(config)
If you want to get data our of your Knowledge Graph and provide it as context for your LangChain workflow you can use SELECT, DESCRIBE or CONSTRUCT queries.
- SELECT queries: you want to have a table as an output and know exactly how to write a SPARQL query; you might induce a semantic gap in this representation though as a 2-dimensional representation of a graph flattens the structure and makes some context harder to interpret
- CONSTRUCT queries: you want a graph as an output and you know how to write SPARQL queries, the resulting graph can be converted into json or text for an LLM prompt input
- DESCRIBE queries: you want to provide the URI of a graph node and get information on it — the output will be a graph as for CONSTRUCT
Here are some examples you can add to your code based on the Tube Manufacturing data:
select_query = """
PREFIX : <http://altair.com/ontologies/MFG_Tubes#>
PREFIX con: <http://altair.com/ontologies/Connecting_Model#p_>
PREFIX sen: <http://altair.com/ontologies/Sensor#>
SELECT ?tube ?materialId ?temperature
WHERE {
?tube a :Tube;
:materialId ?materialId ;
con:Has_Sensor/sen:temperatureReading/sen:temperature ?temperature .
FILTER(?temperature>25)
} LIMIT 10
"""
describe_query = """
DESCRIBE <http://altair.com/MFG_Tubes/Tube/TA-00109>
"""
construct_query = """
PREFIX : <http://altair.com/ontologies/MFG_Tubes#>
PREFIX con: <http://altair.com/ontologies/Connecting_Model#p_>
PREFIX sen: <http://altair.com/ontologies/Sensor#>
CONSTRUCT {
?tube a :Tube;
:materialId ?materialId ;
}
WHERE {
?tube a :Tube;
:materialId ?materialId ;
con:Has_Sensor/sen:temperatureReading/sen:temperature ?temperature .
FILTER(?temperature>25)
} LIMIT 10
"""
To run these queries we need to use different functions as the logic and output format of SELECT differs from CONSTRUCT/DESCRIBE these will be more useful in many cases when working with LLMs as they can be easily transformed into sentences.
# SELECT EXAMPLE
results_select_query = query_graphmart_SELECT(anzo_client, config['connection']['graphmart'], select_query)
results_select_query = replace_prefixes(results_select_query, namespaces)
# DESCRIBE EXAMPLE
results_describe_query = query_graphmart_DESCRIBE_CONSTRUCT(anzo_client, config['connection']['graphmart'], describe_query, 'json')
# CONSTRUCT EXAMPLE
results_construct_query = query_graphmart_DESCRIBE_CONSTRUCT(anzo_client, config['connection']['graphmart'], construct_query, 'rdf')
You can decide how you want to output the results for DESCRIBE and CONSTRUCT queries:
- json — outputs triples in json format
- rdf — outputs rdflib conjunctivegraph
- records — outputs list of graph studio records
print(json.dumps(results_construct_query, indent=4))
From here you can refer to LangChain documentation but one option to use the output would be to provide it to the prompt:
# Load environment variables from .env file
load_dotenv()
# Get the OpenAI API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OpenAI API key not found in environment variables")
model = ChatOpenAI(model_name="gpt-4", openai_api_key=api_key)
# Define a prompt template
prompt_template = PromptTemplate(
input_variables=["question", "results_construct_query"],
template="Answer the following question: {question} using the semantic context provided in {results_construct_query}"
)
# Create an LLMChain with the LLM and the prompt template
chain = LLMChain(llm=model, prompt=prompt_template)
# Define the question
question = "What is the temperature of the tube with materialId TA-00109?"
# Run the chain with the provided question and results from the construct query
answer = chain.run({"question": question, "results_construct_query": json.dumps(results_construct_query)})
# Print the answer
print(answer)
The LLM can be sensitive to the format you choose. A classic json output can be great for post-processing but is difficult to interpret:
[
{
"namedGraphUri": "",
"subject": {
"value": "http://altair.com/MFG_Tubes/Tube/TA-00280",
"objectType": "uri"
},
"predicate": "http://altair.com/ontologies/MFG_Tubes#materialId",
"object": {
"value": "SP-0028",
"objectType": "literal"
}
}
]
The list of records is easier to read, still in JSON but has still a complex structure for LLMs who are best at processing written text.
[
{
"s": "http://altair.com/MFG_Tubes/Tube/TA-00109",
"p": "http://altair.com/ontologies/MFG_Tubes#materialId",
"o": "SP-0029"
},
{
"s": "http://altair.com/MFG_Tubes/Tube/TA-00109",
"p": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
"o": "http://altair.com/ontologies/MFG_Tubes#Tube"
}
]
In most cases OpenAIs models and other popular ones will prefer the n3 format, you can consider removing the prefixes completely in case your ontology structure does not cause confusion with similar terminology from different ontologies.
@prefix mfg: <http://altair.com/ontologies/MFG_Tubes#> .
mfg:TA-00109 a mfg:Tube ;
mfg:materialId "SP-0029" .
In some cases we have been transforming the n3 tripels into sentences by using a call to the LLM as an intermediate step. The result improvements strongly depend on the complexity and are not necessarily a critical step. You can also use a rule-based approach while might be slightly more fragile but still useful:
from rdflib import Graph, Namespace
def n3_to_text(n3_data: str) -> str:
g = Graph()
g.parse(data=n3_data, format='n3')
text_output = []
for subj, pred, obj in g:
subj_label = subj.n3(g.namespace_manager).split(':')[-1]
pred_label = pred.n3(g.namespace_manager).split(':')[-1].replace('_', ' ')
obj_label = obj.n3(g.namespace_manager).split(':')[-1]
text_output.append(f"{subj_label} {pred_label} {obj_label}.")
return "\n".join(text_output)
n3_example = """
@prefix mfg: <http://altair.com/ontologies/MFG_Tubes#> .
mfg:TA-00109 a mfg:Tube ;
mfg:materialId "SP-0029" .
"""
print(n3_to_text(n3_example))
Usually I prefer to define CONSTRUCT templates in the graph and ontology (can be as simple as a string in the dc:description of an instance), so if you need the context for an instance of a specific class we store a template for it with relevant attributes and graph shape to extract as context for GraphRAG.
Wow!
If you’ve made it this far, give yourself a round of applause — you’ve unlocked the power of two game-changing technologies that will redefine the future of enterprise landscapes and agent-driven workflows!