Introduction #
For a long time I thought algorithms and models were the core of AI. It took seeing multiple projects to realize that data is what truly determines the ceiling of any AI system. I have watched teams use the exact same open-source models, yet deliver wildly different results purely because of how they prepared their data. That realization pushed me to dig deeper into everything from annotation standards to knowledge graphs. The deeper I went, the more I realized how much nuance hides in this space. This post is my attempt to lay out a complete picture of what “data-driven AI” really means.
Data Annotation: Far More Than “Slapping Labels on Things” #
Simple annotation tasks really are low-barrier. Deciding whether a review is positive or negative? An annotator can glance at it and decide. You probably don’t even need documentation.
Complex scenarios are a completely different story. Say you’re building a medical consultation system and need to annotate symptom-disease mappings. This is not something you handle by intuition. You need to produce a complete annotation document that covers at least:
- Annotation objectives: What are we labeling? Why are we defining it this way?
- Annotation guidelines: When do we label something A versus B? How do we handle edge cases?
- Review process: How do we quality-check after annotation? What’s the sampling ratio? How do we resolve disagreements?
The quality of your annotation document directly determines the quality of your training data, which in turn directly determines your model’s ceiling1. This cannot be overstated.
Fine-Tuning Data: Learning to “Read the Schema” #
One piece of work that’s easy to overlook but critically important during LLM fine-tuning is checking the Schema of your fine-tuning data. Specifically, checking whether user intent annotations are consistent with your intent taxonomy.
For example: your intent taxonomy has a “loan early repayment” category, but the annotated data contains many instances of “want to pay off my loan early,” “can I pay less interest,” and “how do I make an early repayment” all labeled as different intents. That’s annotation inconsistency, and it needs to be corrected immediately.
Then there’s data quality assessment. Are there outliers? Lots of missing values? Duplicate samples? These might seem like a data engineer’s job, but if you can’t judge data quality yourself, you have no way to ensure the model will perform as expected.
One sentence: data quality directly determines model performance. This is not a slogan. It’s an iron law.
Data-Driven Model Iteration #
A common misconception is that once a model goes live, the hard part is over. On the contrary, going live is where iteration begins.
A model’s performance in production often differs significantly from test environments. You need data to measure real-world performance, identify problems, and guide the next iteration. Concretely, this means breaking model performance down into quantifiable metrics:
- User intent recognition accuracy: Is the model correctly understanding what users want?
- Human escalation rate: Are users asking for human agents because the model’s answers are bad? Is this rate going down?
- Error intent ratio: Which intents are frequently misrecognized? Is there a pattern, or is it random?
A practical example. Say you own a banking loan customer service chatbot. A week after launch, the dashboard shows that the “repayment method change” intent has only 65% recognition accuracy, far below the overall average of 85%. Diving into the data reveals that many users say “I want to switch my loan to equal principal and interest,” and the model fails to map this to the “repayment method change” intent.
What do you do? Targeted data augmentation and strategy adjustment: add more training samples with these phrasings, or insert a synonym mapping layer before intent recognition.
This is data-driven iteration. It’s not “I feel like the performance is bad.” It’s letting data precisely tell you what’s wrong, by how much, and how to fix it2.
Another practical tip: build a data dashboard. Don’t run ad-hoc SQL queries every time. Build a dashboard that engineering, product, and operations can all see, so everyone aligns on the same numbers. The dashboard isn’t for making reports look pretty. It’s for making sure everyone makes decisions from the same data.
Data Granularity: The Deciding Factor for AI Analysis Precision #
This is the most easily overlooked aspect.
Why do so many companies get poor results from AI? It’s not that the models aren’t powerful enough. It’s that their data granularity is too coarse.
A financial institution wanted to use LLMs to analyze loan approval bottlenecks. The department head said: “Let’s use AI. Let the model tell us how to improve loan approval efficiency.” The model’s answer: “Consider replacing the head of the risk control team.”
Can you blame the model? No. The data fed to it only contained coarse information like “average loan approval time increased by 15%.” The model could only arrive at a macro-level conclusion like “there’s a problem with the risk control process.”
What granularity of data do you need for the model to produce valuable analysis?
- Which specific loan product, approval stage, and risk control checkpoint has the problem?
- What is the customer’s credit profile? What is the approved credit limit? What is the guarantee method?
- When did the approval backlog start? How long has it lasted?
- Have risk control parameters (overdue thresholds, credit scores, debt-to-income ratios) fluctuated?
The finer the data granularity, the more precise the model’s conclusions. With the above fine-grained data, an LLM could tell you: “The third-party approval stage for personal consumer loans saw a 12% increase in average processing time after February 15th, caused by a batch of customers with generally lower credit scores leading to more manual reviews. Recommend optimizing auto-approval rules or adjusting the credit score threshold.”
A company’s data foundation is a prerequisite for using AI3. Many companies rush into AI before they’ve even properly accumulated data. The result is inevitable: garbage in, garbage out.
Knowledge Graphs: The Structured Tool That Keeps AI Honest #
When talking about data capabilities, you can’t skip knowledge graphs.
The Hallucination Problem #
LLMs are trained on public internet data. That data contains both accurate and inaccurate information, both current and outdated content. This leads to situations where models confidently state things that are simply wrong. The academic term is “Hallucination”4.
In casual conversation scenarios, hallucinations might just cause an embarrassing moment. But in high-stakes domains like healthcare and finance, hallucinations are unacceptable. Inaccurate answers could lead to serious health problems or even life-threatening situations.
How Knowledge Graphs Address Hallucinations #
A knowledge graph is essentially a structured knowledge base. Instead of letting the model freestyle, it draws a “fence” around the model, restricting it to reasoning only within known factual relationships. In practice, knowledge graphs are typically combined with RAG, commonly known as GraphRAG5.
Understanding Knowledge Graphs Through a Medical Scenario #
Suppose there’s a medical knowledge graph containing these relationships:
- Drug-to-disease relationships
- Drug-to-ingredient relationships
- Ingredient-to-organ relationships
- Drug-to-side-effect relationships
- Symptom-to-disease relationships
A user asks: “Is it safe for a hypertension patient to take Boloffi?”
Without a knowledge graph, the LLM might give a vague answer based on scattered training data, or even fabricate non-existent facts.
With a knowledge graph, the reasoning chain looks like this:
Boloffi → releases chemoxydase → affects prostate → …… → damages kidney → leads to kidney diseaseCombined with the LLM’s reasoning and language generation capabilities, the system delivers an accurate risk assessment: taking Boloffi carries a kidney damage risk for hypertension patients and is not recommended.
The knowledge graph plays three key roles here:
- Precisely linking complex relationships: From drug ingredients to organ impacts, every step is traceable.
- Enabling multi-hop reasoning: No need to manually predefine “A is related to B.” The system can automatically reason from A to B, then B to C, until reaching a conclusion.
- Improving explainability: Instead of giving the user just a conclusion, it shows the complete reasoning chain, building trust.
Knowledge Graphs Go Far Beyond Healthcare #
- Social networks: Relationship networks between people, predicting hidden connections, recommending potential contacts.
- Law enforcement: Building graphs from call records and location data for case analysis.
- Education: Knowledge point relationship graphs for personalized learning path recommendations.
- Taxation: Invoice flow analysis, building knowledge graphs to identify tax evasion.
Six Capability Dimensions of Knowledge Graphs #
To effectively use knowledge graphs, you need at least a working understanding of six dimensions:
- Knowledge modeling: Defining the graph’s structure. What entities and relationships exist?
- Knowledge extraction: Extracting entities and relationships from unstructured data.
- Knowledge fusion: Merging knowledge from different sources and resolving conflicts.
- Knowledge visualization: Making complex graph relationships intuitive and understandable.
- Knowledge computation: Performing reasoning and analysis based on the graph.
- Knowledge application: Landing graph capabilities in concrete product scenarios.
You don’t need to master every dimension, but you should understand what each one does and why it matters. That’s how you make the right decisions in a project.
-
There’s a widely circulated saying in the industry: “Garbage in, garbage out.” Data quality sets the ceiling for machine learning performance; algorithms merely approach that ceiling. ↩︎
-
The data-driven iteration approach shares its philosophical roots with A/B testing in traditional internet products. The core idea is always the same: replace subjective judgment with data. ↩︎
-
McKinsey’s 2023 report noted that the biggest barrier to enterprise AI adoption is not technology, but the maturity of data infrastructure. ↩︎
-
LLM hallucination is one of the most active research directions in the field. A definitive cure remains an open problem. ↩︎
-
GraphRAG is a method proposed by Microsoft in 2024 that combines knowledge graphs with retrieval-augmented generation, significantly improving accuracy on complex question-answering tasks. ↩︎