In an era where AI strategy hinges on data quality, Knowledge Graphs aren’t optional — they’re your foundation.

LLM fine-tuning is deeply affected by data quality and semantics, making it crucial to adopt a sustainable data preparation strategy. Using traditional databases as sources can often lead to wasted resources.

7 min readFeb 1, 2025

A Knowledge Graph (KG) is a knowledge representation technology that captures and represents the relationships between key business concepts of an enterprise ecosystem. For example, in pharmaceutical manufacturing, a Knowledge Graph integrates critical elements such as production processes, raw materials, equipment, regulatory requirements, and product results, leveraging wide-ranging impact chains.

Fine-tuning large language models without grounding the approach in data from knowledge graphs is like setting out on a journey without a map: you might make progress, but you’ll likely waste time retracing your steps and correcting mistakes along the way.

Please keep in mind that for many business applications, fine-tuning may be over-engineered. It is quite resource-intensive and often unnecessary if other approaches like GraphRAG or LoRA can achieve your goals.

For instance, you probably want to use GraphRAG when real-time fact retrieval and explainability are needed, and LoRA when small behavioral tweaks are sufficient. Fine-tuning comes into play whenever task-specific reasoning, strict formatting, or extreme domain expertise is required. Here are some examples:

GraphRAG: Retrieving updated GMP (Good Manufacturing Practices) regulations dynamically instead of embedding them in the model. These can be changed and synchronized with the Knowledge Graph to be retrieved at query time while a high frequency of fine-tuning an LLM is expensive.
LoRA: Adapting an LLM’s response style to generate compliance reports in a structured regulatory format.
Fine-Tuning: Train an LLM for risk assessment of supply chains. Pharma companies face risks like raw material shortages, geopolitical disruptions, or temperature excursions in cold chains. These require real-time synthesis of multi-modal data (weather, logistics APIs, supplier news, IoT sensor alerts). In this case GraphRAG wont be ideal as it relies on static knowledge graphs (e.g., supplier databases) and has limitations in interpreting unstructured signals (e.g., news about a port strike in Vietnam). At the same time, LoRA likely cannot sufficiently encode domain-specific risk patterns (e.g., linking humidity sensor data to vaccine stability thresholds).

Full fine-tuning probably, almost definitely, should not be your first LLM related project. Start with GraphRAG or try parameter-efficient fine-tuning methods like LoRA first. Do full fine-tuning if you know exactly where the limitations of these previous approaches are and where you expect fully fine-tuned models to perform better.

If you actually do decide to build fine-tuned models in-house, which is likely what most companies will want due to data privacy concerns, the task demands more than brittle, one-off fine-tuning. Let’s start by understanding what is needed for fine-tuning — training data, preferably in JSONL, CSV, or similar formats.

JSONL (JSON Lines) is a format where each line contains a single JSON object, making it efficient for fine-tuning LLMs. It’s required because fine-tuning models like GPT, LLaMA, and Mistral expect structured, line-by-line training examples instead of a single large JSON file. Each entry typically includes a prompt (input) and response (output) to teach the model domain-specific tasks.

{
"prompt": "What are the critical parameters for tablet compression?", 
"response": "Critical parameters include tablet weight, hardness, thickness, and compression force."
}
{
"prompt": "Explain a corrective action for high humidity in drug storage.", 
"response": "Increase dehumidification, adjust air handling systems, and monitor stability per GMP guidelines."
}
{
"prompt": "What is the purpose of cleanroom classification?", 
"response": "It ensures contamination control in pharmaceutical manufacturing by regulating particle count and airflow."
}

At this point we do not only need high-quality semantically enriched data, we also need to ensure that these characteristics are preserved when extracted from the database and presented an LLM. Keep in mind that JSON is just a serialization format and can preserve some hierarchical and object relationships. But in this case we don’t care about it as in most cases LLMs will respond best to fine-tuning data where the questions and answers are natural language sentences. They are just organized in JSON. Natural language does preserve more semantics than any other type of knowledge representation (In a previous article, I explain why semantic KGs are the only way to capture business logic at scale — a prerequisite for reliable fine-tuning: https://shalumov-boris.medium.com/why-semantic-knowledge-graphs-are-the-only-way-to-build-an-enterprise-data-fabric-68f991eb4116).

While JSON/CSV can also actually store KG data, they lack built-in semantics. For example, a CSV column named ‘containsRawMaterial’ doesn’t tell the LLM whether this is a mandatory quality control parameter or a casual reference.

Knowledge graphs can hold this knowledge quite easily, while relational databases require a complex PK/FK labyrinth to achieve the same. This is also the reason why we don’t want to use relational formats to actually hold the knowledge neither initially nor later in the process before feeding it to the LLM.

Let’s walk through how domain-specific data moves from a graph structure to fine-tuning-ready prompts and responses.

@prefix : <http://example.com/pharma#> 

# Classes

:TabletProduction a rdfs:Class .
:RawMaterial a rdfs:Class .
:QualityControl a rdfs:Class .

# Instances

:TabletBatch001 a :TabletProduction ;
 :hasBatchNumber "TB001" ;
 :hasManufacturingDate "2025–04–01" ;
 :containsRawMaterial :Paracetamol ;
 :qualityControlledBy :TabletWeightTest .

:Paracetamol a :RawMaterial ;
 :hasName "Paracetamol" ;
 :isSourcedFrom :SupplierABC .

:TabletWeightTest a :QualityControl ;
 :isConductedOn :TabletBatch001 ;
 :hasStandardValue "500 mg" .

We can convert this Knowledge Graph preserving the semantic context of the data and the overarching ontology to natural language stored in JSONL format!

There are a few approaches. In most cases you should add guidelines to your ontology on how a specific class should be considered during fine-tuning. The most popular approach would be sending the graph along with fine-tuning specifications in the ontology as part of a request to an LLM and ask it to generate JSONL examples for you. Ideally you have a property on each class that servers as a template stating which attributes are relevant for fine-tuning. You might need format/content validation as a subsequent step.

Enhance RDF Graph with Metadata: Add a property (like :relevantForFineTuning) to each class that specifies which attributes are relevant for fine-tuning. There are other ways to use node templates or SHACL not discussed in this example, although following a similar logic.
Select subset of relevant data and send RDF to LLM: The graph is sent to an LLM in a structured format asking it to generate the appropriate JSONL examples.
Validate the Output: After receiving the generated JSONL from the LLM, validate the format (e.g., correct prompts, responses) and the content (e.g. ask the LLM whether a specific triple can be found in the knowledge base at all).

Your graph would look something like this after step 1.

@prefix : <http://example.com/pharma#> .

# Classes
:TabletProduction a rdfs:Class ;
    :relevantForFineTuning :hasBatchNumber, :hasManufacturingDate, :containsRawMaterial .

:RawMaterial a rdfs:Class ;
    :relevantForFineTuning :hasName, :isSourcedFrom .

:QualityControl a rdfs:Class ;
    :relevantForFineTuning :hasStandardValue, :isConductedOn .

# Instances
:TabletBatch001 a :TabletProduction ;
    :hasBatchNumber "TB001" ;
    :hasManufacturingDate "2025-04-01" ;
    :containsRawMaterial :Paracetamol ;
    :qualityControlledBy :TabletWeightTest .

:Paracetamol a :RawMaterial ;
    :hasName "Paracetamol" ;
    :isSourcedFrom :SupplierABC .

:TabletWeightTest a :QualityControl ;
    :isConductedOn :TabletBatch001 ;
    :hasStandardValue "500 mg" .

You can get relevant data out using SPARQL in step 2 and provide the output along with a prompt describing the task.

PREFIX : <http://example.com/pharma#>

CONSTRUCT # use construct or DESCRIBE to retain RDF semantics instead of SELECT
  {
     ?instance ?property ?value
  }

WHERE {
    # Get the class for each instance
    ?instance a ?class .
    
    # Get the relevantForFineTuning properties for the class
    ?class :relevantForFineTuning ?property .
    
    # Get the value for each relevant property
    ?instance ?property ?value .
}

Prompt: Given the RDF instance and its context, generate a question and its corresponding answer based on the data. The answer should be derived directly from the RDF information and context. The response should follow the question-answer format in JSON-L, where the model outputs a question and its respective answer.

Example Input

:TabletBatch001 a :TabletProduction ;
    :hasBatchNumber "TB001" ;

Example Input
{
"prompt": "What is the batch number of TabletBatch001?", 
"response": "The batch number of TabletBatch001 is TB001."
}

Send results of CONSTRUCT query as part of the LLM request to generate a JSONL-line. After that use JSON schema validators to check formatting, and run SPARQL queries against your KG to verify responses align with ground-truth triples (e.g., confirm ‘TabletBatch001 was manufactured on 2025–04–01.’ exists as a triple in the KG before allowing it into training data).

{
"prompt": "What is the batch number of TabletBatch001?", 
"response": "The batch number of TabletBatch001 is TB001."
}
{
"prompt": "When was TabletBatch001 manufactured?", 
"response": "TabletBatch001 was manufactured on 2025-04-01."
}
{
"prompt": "What raw material is used in TabletBatch001?", 
"response": "TabletBatch001 contains Paracetamol as the raw material."
}
{
"prompt": "What is the name of the raw material Paracetamol?", 
"response": "The raw material Paracetamol is known as Paracetamol."
}
{
"prompt": "Who supplies Paracetamol?", 
"response": "Paracetamol is sourced from SupplierABC."
}
{
"prompt": "What is the standard value for TabletWeightTest?", 
"response": "The standard value for TabletWeightTest is 500 mg."
}
{
"prompt": "What is TabletWeightTest conducted on?", 
"response": "TabletWeightTest is conducted on TabletBatch001."
}

By converting structured graphs into natural language prompt-response pairs, you ensure your training data preserves critical logic and avoids ambiguity as it did in the Knowledge Graph. This isn’t just about better outputs; it’s about auditability, scalability, and escaping the hallucination trap. In an era where AI strategy hinges on data quality, Knowledge Graphs aren’t optional — they’re your foundation. And yes, graphs aren’t free either — they require upfront modeling. But this investment pays off in auditability, re-usability, and model accuracy quite quickly.

In an era where AI strategy hinges on data quality, Knowledge Graphs aren’t optional — they’re your foundation.

LLM fine-tuning is deeply affected by data quality and semantics, making it crucial to adopt a sustainable data preparation strategy. Using traditional databases as sources can often lead to wasted resources.

Written by Boris Shalumov

No responses yet