Introduction #
How do you evaluate a model after choosing it? And how do you keep tracking its performance after deployment? These two questions have bothered me for a long time. The LLM landscape is insanely competitive right now, with new models dropping every month and benchmark rankings flying around everywhere. But there’s often a huge gap between benchmark scores and real-world business performance. You deploy a model, the accuracy sits at 60%, and you have no idea why. Is it the model? The data? The labeling?
I spent a long time figuring out both evaluation/selection and performance tracking, and I realized they’re really two phases of the same pipeline: first, use systematic evaluation to pick the model that best fits your business, then use continuous performance tracking to make sure the model actually delivers in production. Today I’m writing down this complete methodology, covering everything from evaluation frameworks and metrics systems to the closed loop of continuous optimization.
The Starting Point: Not the Best Model, but the Right One #
Let’s get one thing straight: the goal of evaluation is not to find the “best model,” but to find the model that best fits your business scenario. These are completely different things.
A model might crush every benchmark out there, but if it falls apart in your vertical domain (say, legal contract review or medical diagnosis), then it’s not a good model for you. Conversely, a small model that’s been fine-tuned for your specific use case and costs a fraction to run might be your best choice.
So the core logic of evaluation is: select the most suitable model for your business scenario at the lowest possible cost. This requires a systematic evaluation plan.
SuperCLUE Benchmark: Convenient but Not Transparent Enough #
SuperCLUE1 is a widely followed evaluation benchmark in the AI space. It tests models with 660 questions across dimensions like image quality, consistency, creativity, and complex adaptability. For product managers, its value lies in providing objective evaluation data at low cost. You don’t need to run evaluations yourself; just check the leaderboard to get a rough sense of where mainstream models stand.
But it has an obvious limitation: the evaluation data is not open-source. You can only see the final rankings, not how each model performs on individual questions. This means you can’t drill down into whether a model is strong or weak on the dimensions you actually care about. You lack specific case evidence for technical reports. And you can’t do attribution analysis, figuring out why a model got a particular question wrong.
Think of SuperCLUE like a university ranking. You know Tsinghua and Peking University are at the top, but you can’t tell which one has the better computer science program for your specific needs. For precise selection, you need to dig into specific dimensions.
Open-Source Evaluation: EvalScope in Practice #
What do you do when SuperCLUE doesn’t give you enough detail? You turn to open-source evaluation tools. The core value of open-source evaluation is that it’s reproducible, decomposable, and attributable. Using industry-standard open datasets, the entire evaluation pipeline is transparent, and you can trace every answer back to the original question.
But open-source evaluation has a pain point: high barrier to entry. It involves massive datasets, statistical scripts, and metric calculations. Just getting the evaluation pipeline running can take forever.
EvalScope[^2] (built on the FlagEval open-source framework) exists to solve this problem. It provides three key capabilities:
Automated evaluation. The entire pipeline is cleanly packaged: install dependencies, configure the text encoder, select datasets, run the script. No need to write evaluation code manually or download and process datasets one by one.
Visual reports. After about 20 minutes of evaluation, you get a visual report with diagnostic conclusions. Instead of cold spreadsheets, you get graphical comparative analysis showing exactly where each model excels and where it falls short.
Attribution analysis. This is the most valuable part. It doesn’t just tell you the score; it tells you why the model got that score. Which types of questions did it get wrong? Is the reasoning ability insufficient, or is the knowledge coverage incomplete? Does it perform worse in Chinese or English?
With attribution analysis, you can make targeted decisions: switch models, fine-tune for specific scenarios, or adjust your prompt strategy. Here’s a concrete example: suppose you’re building a legal Q&A product. Evaluation shows Model A has a higher overall score than Model B, but attribution analysis reveals Model A has a 40% error rate on “legal reasoning” questions, while Model B only has 15%. If you only looked at total scores, you’d pick Model A. But with attribution analysis, Model B is the clear winner.
Evaluation in Practice: Four Steps #
Beyond tools, here are some practical guidelines.
Define Your Evaluation Dimensions #
Don’t run a full evaluation right out of the gate. First figure out what matters most for your business:
- Accuracy-first (medical diagnosis, legal consulting): focus on knowledge accuracy and reasoning ability
- Fluency-first (content creation, marketing copy): focus on language quality and creative ability
- Stability-first (customer service chatbots): focus on output consistency and format compliance
- Cost-first (high-concurrency scenarios): focus on token consumption and inference speed
Build a Business-Specific Evaluation Set #
General leaderboards are just references. The most convincing evaluation uses your own business data. Sample representative questions from real user interactions, manually label gold-standard answers, and create your own evaluation set. This beats any general benchmark.
Compare Across Multiple Dimensions #
A high BLEU[^3] score doesn’t mean a model performs well in your actual business. Combine multiple metrics with human evaluation and run A/B tests in production. Results are what matter.
Evaluate Continuously #
Models iterate, businesses evolve, and user needs change. Evaluation isn’t a one-time thing. Build an evaluation pipeline that automatically runs a round of evaluation after every model update or business change.
The endpoint of evaluation isn’t a report; it’s a decision. Good evaluation should directly answer: how does this model perform on core scenarios? How much improvement over the current solution? Is inference cost within acceptable range? Are there specific weaknesses that need to be addressed through prompt optimization or fine-tuning?
From Selection to Tracking: A Universal Algorithm Performance Process #
Once the model is selected, how do you track its performance? Here’s a three-step universal process: define requirements, track performance, and optimize continuously.
Define Algorithm Requirements #
- Specify inputs and outputs: Algorithm engineers need to know exactly what you’re giving them and what they need to return.
- Define business metrics and launch thresholds: “90% accuracy before we can launch” needs to be a hard number, not a vague target.
- Hand off a requirements document to the engineering team: Get it in writing to avoid verbal misunderstandings.
If your company has an algorithm platform, check whether existing capabilities meet your needs before building from scratch. For complex scenarios, you may need algorithm engineers to assess feasibility early on.
Track Performance #
Submitting requirements is just the beginning:
- Maintain ongoing communication with algorithm engineers about their implementation approach
- Provide labeled data (this is critically important; more on this below)
- Track experimental results continuously, recording metric changes with each model iteration
- Conduct multi-dimensional data validation before launch
Optimize Continuously #
Launching the model doesn’t mean the work is done. Continuously collect Bad Cases in production, work with algorithm engineers to retrain on those cases, and iteratively improve business metrics.
Let me walk through a concrete example.
Case Study: Sentiment Recognition Algorithm #
In a customer service chatbot scenario, we discovered a major experience issue: users were clearly emotional, but the bot kept responding with matter-of-fact replies.
Consider this example:
User: “This is ridiculous! I submitted my loan application seven days ago and still no approval? I’m filing a complaint!” Bot: “Loan application received. Under review.”
How do you think the user felt reading that? It’s like pouring gasoline on a fire. The user is already anxious and frustrated, and the bot just throws out a cold “under review” line. It feels like talking to a wall.
What we wanted was: acknowledge the emotion first, then provide the answer. “I’m sorry to hear you’re frustrated. Let me check on this right away…” Same loan approval process, but acknowledging the user’s emotion first and then sharing the approval status makes a world of difference. To make this happen, we needed a sentiment recognition algorithm: take the user’s message as input, output a sentiment label (positive/negative/neutral), and target 90%+ accuracy.
Classification Algorithms and Data Labeling #
The core technology behind sentiment recognition is classification algorithms. Binary classification is like determining whether a sheep is black or white, with only two possible outcomes. Multi-class classification is like sorting user messages into three piles: positive, negative, and neutral. Sentiment recognition is a classic multi-class classification task.
The training process can be summarized in one sentence: humans label data, the algorithm learns from labeled data, a model is trained, and the model classifies new sentences. It’s like teaching a child to recognize animals: you show them lots of cat photos and tell them “this is a cat.” After enough examples, the child can recognize cats they’ve never seen before.
Why AI Product Managers Need to Understand Classification #
Consider this example:
Sentence A: “I can’t take this anymore, I’m filing a complaint” Sentence B: “Is this ever going to work? I’m really anxious”
As a human, you can immediately tell A is more negative than B. A literally says “complaint.” But to a classification algorithm, these two sentences might produce similar negativity scores. Why? Because classification algorithms fundamentally look at word frequency and statistical features. They can’t really understand that “complaint” carries a stronger negative signal than “anxious.”
If you don’t understand this technical limitation, you might think the algorithm engineer is “not good enough.” But if you do understand it, you know you can add a keyword rule as a supplement. This is the “when the algorithm falls short, use product thinking to fill the gap” approach.
Data Labeling: The Ceiling of Model Performance #
When it comes to algorithm effectiveness, data labeling is unavoidable. The core takeaway is: the accuracy of labeled data sets the upper bound of model performance. If you label 20,000 sentences and 95% are labeled correctly while 5% are wrong or ambiguous, then no matter how much you tune the model, the maximum accuracy will be around 95%. Why? Because the “correct answers” the model learns from have 5% errors built in. How can it surpass its own textbook?
AI’s three foundational pillars are data, compute, and algorithms, but data accounts for about 70% of the weight[^4]. Here’s a telling example: have a fresh graduate and a senior algorithm engineer train classification models using the same data, and the accuracy difference will be only 1-2%. This means data quality matters far more than the algorithm engineer’s experience.
The recommended approach: have at least two people label the same dataset, take the intersection of their agreement, and have a third person or the original pair review the disagreements together.
Confusion Matrix: Reading the Algorithm’s “Report Card” #
After model training, the algorithm engineer will give you a “report card.” To read it, you need to understand the confusion matrix[^5].
How to Split the Dataset #
You don’t use all your data for training. The standard split is: 80% for the training set (say, 16,000 out of 20,000 samples) for the model to learn from; 20% for the validation/test set (4,000 samples) to evaluate how well the model learned. Think of it like school: 16,000 practice problems and 4,000 exam problems. The exam problems are ones the model has never seen, so they give a true reflection of performance.
What the Confusion Matrix Looks Like #
Using “positive sentiment prediction” as an example:
| Predicted: Positive | Predicted: Negative/Neutral | |
|---|---|---|
| Actual: Positive | TP (True Positive) | FN (False Negative) |
| Actual: Negative/Neutral | FP (False Positive) | TN (True Negative) |
Let me translate:
- TP (True Positive): Actually positive, model predicted positive. Got it right.
- FN (False Negative): Actually positive, model predicted negative. Missed it.
- FP (False Positive): Actually negative, model predicted positive. False alarm.
- TN (True Negative): Actually negative, model predicted negative. Got it right.
Three Core Metrics #
Based on the confusion matrix, you can calculate three core metrics:
Precision: Out of all the samples the model called “positive,” how many were actually positive? In plain terms: of the positive sentiments the model claims to have found, how many are truly positive?
Recall: Out of all actually positive samples, how many did the model recover? In plain terms: did the model miss any of the real positive sentiments?
F1 Score: The harmonic mean of precision and recall, serving as a comprehensive metric.
Precision and recall often trade off against each other. Push precision higher, and the model becomes more conservative, potentially missing some positives (recall drops). Push recall higher, and the model becomes more aggressive, potentially misclassifying negatives as positives (precision drops). The F1 score helps you find the balance between the two. Higher is better for all three, but don’t chase perfection. Finding the right balance for your business scenario matters more.
Multi-Dimensional Data Validation: The Final Check Before Launch #
The model performs well on the test set. Ready to launch? Not quite. You need multi-dimensional data validation to verify the model performs consistently across different scenarios:
- Different business lines: Sample test cases from both pre-sales and after-sales to check stability
- Different user channels: App and web users may express themselves differently
- Different phrasing styles: For the same issue, some users say “early repayment” while others say “I don’t want the loan anymore.” Can the model handle both?
Why bother? Because a model might perform brilliantly on one type of data but poorly on another. If you only look at aggregate metrics, these weaknesses get buried in the averages. If validation reveals significant performance gaps across scenarios, you may need to train separate models per scenario or industry rather than using a one-size-fits-all approach.
Continuous Optimization: A Never-Ending Loop #
After launch, continuous optimization never stops. Focus on two things.
Collect Production Bad Cases #
Bad Cases are instances where the model got it wrong. For example, clearly negative sentiment being classified as positive. These cases are gold mines. Collect them, re-label them, retrain the model, and accuracy will improve step by step.
Regularly Update Training Data #
Business keeps evolving. The phrases users commonly used three months ago may be completely different from what they use today. If the model keeps running on stale data, performance will gradually decay. So you need to regularly collect fresh production data for labeling and model updates.
These two actions form a continuously spinning flywheel: discover problems, collect cases, label data, train model, deploy and validate, discover new problems. The faster this flywheel spins, the better your product becomes.
Evaluation and performance tracking aren’t the exclusive domain of the tech team. They’re decision-making processes you must be involved in. Whether you’re selecting models or optimizing them, remember one principle: the endpoint of evaluation isn’t a report; it’s a decision. No matter how beautiful the report looks, if it doesn’t help you make better product decisions, it’s a wasted evaluation.
References #
-
SuperCLUE is a comprehensive evaluation benchmark for Chinese general-purpose large language models, regularly publishing leaderboard rankings across multiple dimensions and scenarios. ↩︎