Intent Recognition Deep Dive: From Requirement Analysis to Algorithm Deployment

Table of Contents

Introduction
#

I’ve been looking into smart customer service solutions lately, and the deeper I dug, the more I realized that intent recognition is far more complex than it seems. It’s not just a simple classification problem — it’s a full engineering methodology, from single-case analysis to building a global intent taxonomy. This article captures my thinking, in case it helps anyone else working on conversational products.

What Intent Recognition Really Means
#

Let’s start with a core insight: intent is about purpose, not expression.

A real-life example. A friend invites you to dinner, you don’t want to go, so you say “maybe another time.” What you said is “another time,” but your purpose is to decline. If the friend follows up with “how about next Thursday?”, you’d think they’re being clueless. You were clearly declining, but they took the words literally.

The core capability of an AI product is inferring intent from user behavior data. Users don’t speak in templates. Real expressions are extremely colloquial, casual, and often emotional. A single user message could be interpreted in four or more directions, and you need the ability to pinpoint the right one.

I like to condense intent analysis into a formula:

Intent Analysis = Purpose + Follow-up Action

This formula has two layers of meaning. First, analyzing intent isn’t about parsing the sentence itself. It’s about figuring out two things: what’s the user’s purpose, and what’s the most appropriate action to take. Second, the ultimate goal of intent recognition is the follow-up action. Every intent category should map to a set of concrete actions. If you recognize an intent but have no action to handle it, the recognition is pointless¹.

With this formula in mind, let’s look at a classic example.

“Why Hasn’t My Loan Been Approved Yet”: One Case That Explains Everything
#

This is a bank customer service scenario. A user sends this message:

“Why hasn’t my loan been approved yet, I submitted everything days ago”

What’s the user’s intent? Let’s not rush. First, list four possible interpretations.

Interpretation 1: Progress inquiry. The user is simply asking about the approval timeline and wants to know how much longer they need to wait. If we label the intent as “progress inquiry,” the follow-up action is to check the current approval status and provide an estimated completion date. The result? The agent says “Your loan is currently being reviewed, please wait patiently.” The user might get even more anxious. I’ve been waiting for days and all you can tell me is to keep waiting?

Interpretation 2: Emotional venting. The user is clearly anxious and impatient. It’s been days with no result, and they’re feeling uneasy. If the system judges it as emotional venting, the follow-up is purely soothing language. But what good is saying “We understand how you feel”? The user’s core need isn’t to be comforted. They want to know when the loan will actually come through.

Interpretation 3: Urgent escalation. The user might urgently need this loan and be counting on the money. We escalate to the approval department for expedited processing. But here’s the problem: first, some loan products inherently have longer approval cycles, and escalation doesn’t necessarily speed things up. Second, the user might just be casually asking and not truly in a rush. Blind escalation could disrupt the approval pipeline.

Interpretation 4: Loan plan adjustment. The user has been waiting a long time without approval, possibly because the current loan product has strict requirements or limited quota. Rather than passively waiting, we proactively help the user evaluate whether they could switch to a different loan product, adjust the loan amount, or modify the term. We trigger the loan plan adjustment process: the user either gets a more suitable plan or learns exactly where the approval is stuck and how much longer it will take, while the agent can provide concrete next steps.

Only the fourth interpretation leaves the user satisfied.

Let’s compare the outcomes:

Interpretation	Follow-up Action	User Satisfaction
Progress inquiry	Report approval status	Unsatisfied. Days of waiting, only to hear “please be patient”
Emotional venting	Soothing language	Unsatisfied. Comforting words don’t solve the real need
Urgent escalation	Expedite approval department	Unsatisfied. May not actually speed things up, and could disrupt the process
Loan plan adjustment	Plan adjustment process	Satisfied. User gets a better plan or clear timeline expectations

This case illustrates a key point: the core of intent analysis isn’t parsing the surface text. It’s working backwards. What follow-up action would satisfy the user’s need? Define the intent around that. The loan plan adjustment process works perfectly because it addresses the user’s actual need (a more suitable plan or clear progress expectations) and enables tracing the approval bottleneck, making any follow-up communication more concrete and evidence-based.

That’s the power of the intent analysis formula: working backwards from the follow-up action to define the intent².

Two more practice scenarios:

Wealth management consultation: A user asks “Can I still buy that 4% annual return wealth management product?” If we recognize it as a simple Q&A, we check the product status and return the answer. If the product is still available, that works. But what if it’s sold out? The user might move their money to another bank. Is there a better follow-up? For instance, recognizing it as a “financial needs consultation” and proactively recommending alternative products with similar returns.

Credit card statement inquiry: A user says “Why is my bill so high this month.” A statement inquiry intent. How should we handle it? Just read out the total amount? Or list the transaction details and flag any suspicious charges first? Different follow-up approaches lead to vastly different user experiences.

Intent Annotation: From Single Cases to a Global View
#

Once you can analyze individual cases, reality hits. The ML engineers won’t come to you with just one case. They’ll ask: what percentage of our traffic does this intent represent? How much business value does optimizing it unlock?

AI teams typically carry core business metrics and are under a lot of pressure. If you propose an intent optimization but it only covers 0.5% of traffic, the algorithm and business teams won’t allocate resources. You need to identify the highest-volume intents and prioritize those to leverage resources effectively.

How do you figure out the distribution? Through intent annotation. Large-scale evaluation paints a complete picture of the intent landscape.

It’s no exaggeration to say that 30 to 40 percent of an AI product person’s time is spent analyzing user intent and running evaluations. Evaluation involves random sampling, data selection, standard-setting, and a whole methodology. One wrong step and your conclusions are completely off.

How to Run a Solid Evaluation
#

Step 1: Define the evaluation set.

The evaluation set is the dataset you’ll annotate. This step determines the foundation of your entire evaluation. A few key points:

The sample size must be large enough. At minimum, 1,000 cases. For businesses at the scale of major commercial banks or joint-stock banks, at least 10,000. Without sufficient volume, random sampling might miss categories, and the distribution won’t reflect reality.

Pick the right time period. You must simulate the real environment as closely as possible. Exclude data from major promotions, holidays, and other special periods. During promotions, user demand is high but atypical. During holidays, inquiry volume might be unusually low. Mix this data in and your distribution becomes distorted. Failing to simulate the real environment is a fatal flaw³.

Sampling method matters. Random without deduplication, random with deduplication, stratified random sampling, each has its use case. Under normal conditions, random sampling without deduplication best represents reality. Why? Because high-frequency issues naturally appear multiple times in production. Deduplicating actually distorts the proportions.

Include enough conversation turns. In bank customer service scenarios, typically sample 3 to 5 messages per conversation to preserve context. A single message might lack context entirely, making it impossible to understand what the user is talking about.

Step 2: Produce the evaluation document.

The evaluation document isn’t for show. Its purpose is to ensure everyone involved knows “what case gets what intent label, and what case doesn’t.”

Each intent should include three parts:

Intent description: A clear text explanation of what the intent is. For example, “Loan plan adjustment is the intent where a user requests to switch loan products or modify loan terms due to approval delays, insufficient quota, or similar reasons.”
Representative queries: The 3 to 5 most typical phrasings of this intent, including different query patterns.
Similar queries: Less typical but still valid phrasings that should be classified under this intent.

If you have 10 intents, list all three parts for each one. Note that the document won’t be perfect at first, and that’s normal. As you do more evaluations, you’ll continuously supplement and refine it.

Step 3: Execute the evaluation.

For participants, at minimum 2 to 3 people doing cross-annotation. Intent evaluation is a subjective judgment process. Different people will interpret things differently, so cross-annotation is essential. More importantly, at least one ML engineer and one business operations person should participate. This is a great opportunity to align different roles in an AI project on a shared understanding of intents⁴.

For calculation, after each independent case is annotated, divide the count for each category by the total to get the percentage. For example, if you evaluate 1,000 cases and 100 are loan plan adjustments, that intent represents 10%.

The final output is an intent landscape map. This map gives you clarity about who you’re really building for.

Three Major Pitfalls in Evaluation
#

After running many evaluations, here are the three most common traps:

Pitfall 1: Incorrect random sampling logic. This is the most insidious trap. Everyone has a different understanding of “random,” and once you add various filtering conditions, the extracted data might not represent the true distribution at all. You annotate all 1,000 cases only to realize the proportions are wrong. Total waste. Always verify the sampling logic before pulling data.

Pitfall 2: Evaluators not taking it seriously, leading to low quality. You distribute the annotation tasks, everyone works independently, and then crams on the last day. The quality is terrible. The fix? Book a meeting room for an afternoon and have everyone annotate together. Discuss questions on the spot, resolve disagreements immediately. This investment is absolutely worth it.

Pitfall 3: Inconsistent evaluation standards. The ML team thinks it should be label A, business thinks B, product thinks C, and they argue endlessly. What to do? Skip the edge cases for now. But when multiple cases reveal a disagreement, someone needs to make the final call. That’s the AI product person’s responsibility. You’re the one who understands users best and the one accountable for the end result.

Building an Intent Taxonomy: From Chaos to Structure
#

After one round of annotation, you face a practical problem: there are too many intents.

A moderately complex customer service business has at least 200 to 300 intents. Large-scale businesses like major commercial banks or joint-stock banks easily exceed 1,000. Without organization and management, maintenance, iteration, and model training become extremely painful.

But people often give incomplete reasons for why you need a taxonomy:

“It makes adding, deleting, and modifying easier with a hierarchy.” True, but not the core reason.
“Having primary and secondary categories helps newcomers understand.” True, but still not the core reason.
“It helps find gaps and discover missing intents.” True, but again, not the core reason.

The Real Reason: Improving Fault Tolerance in Recognition Accuracy
#

Let me explain with an example. Suppose you’re searching for a precise address: City, District, Street, Building Number.

Option A: The search system only recognizes building numbers. Can’t find the building number? Returns empty results. The user gets nothing.

Option B: The search system has a hierarchical structure: City, District, Street, Building Number. Can’t find the building number? Returns results at the street level. The user arrives at the street and can look around, ask someone, or narrow down further.

Neither option perfectly satisfies the “precise to building number” need, but Option B is clearly better. Why? Because it has fault tolerance.

An intent taxonomy works on the same principle:

When an intent can be precisely recognized, having a taxonomy or parent-child structure makes little difference. The model hits it directly and everyone’s happy.
But when an intent can’t be precisely recognized (which is the norm), the difference between having a taxonomy and not having one is enormous. Without a taxonomy, the model can only return a completely irrelevant result, essentially “answering the wrong question.” With a taxonomy, the model can at least identify the parent intent. The result isn’t precise, but the direction is right.

So the fundamental reason for building an intent taxonomy is to improve fault tolerance in recognition accuracy. Anything that affects recognition accuracy is worth doing.

Additionally, the taxonomy is extremely useful for building guided dialogue flows. With a hierarchical structure, you can guide users to narrow their scope when recognition is uncertain, rather than giving them a wrong answer⁵.

Three Principles for Building an Intent Taxonomy
#

Principle 1: Scope each intent with representative and similar queries. During intent analysis, you already identified representative and similar queries for each intent. When building the taxonomy, these become the “boundary lines” for each intent. The concept of scope is critical. The more clearly you define representative and similar queries, the sharper the intent boundaries become.

Principle 2: Keep intent boundaries clear. When you have over 100 intents, you’ll notice some are very similar with significant overlap. If even humans get confused, the model is even more likely to misclassify A as B. Once misclassified, the follow-up action for B won’t satisfy the user’s actual need, resulting in an irrelevant response. In customer service, this is a serious incident. Each intent must have a clearly defined scope with no gray areas.

Principle 3: Group different situations of the same problem at the same level. For example: early loan repayment, interest rate change repayment, overdue payment catch-up. These look like three different intents, but they’re essentially three situations of the same “abnormal repayment processing” problem. The test is simple: are they different situations of the same underlying problem? If yes, group them at the same level.

One more note: fewer levels is better. More levels mean higher management and maintenance costs. If two levels get the job done, don’t add a third.

A Taxonomy Grows, It Doesn’t Get Drawn
#

An intent taxonomy isn’t something you nail down by drawing a mind map in a conference room. It typically takes six months to a year before it stabilizes.

During iteration you’ll constantly encounter these issues: inconsistent annotation standards where the same type of case gets labeled A this week and B next week; redundant intents where two intents turn out to be the same thing; and missing intents where users ask new questions that existing intents don’t cover.

All of this is completely normal. An intent taxonomy is inherently complex and requires continuous iteration. The mindset shouldn’t be “I need to design this perfectly the first time.” It should be “I’ll build a framework first, then refine it through practice.”

The Complete Path
#

Stringing the three steps together:

The user said X, what’s their purpose, what action satisfies it. That’s intent analysis, solving the single-case problem.

How many intents exist, what’s the proportion of each. That’s intent annotation, solving the global understanding problem.

Which intents belong together, what does the hierarchy look like. That’s the intent taxonomy, solving the structural management problem.

Connect all three, and you get a complete intent taxonomy, which is essentially a panoramic map of user needs.

With this map in hand, whether you build a traditional QA knowledge base for matching or use a generative model for RAG, you have a solid foundation. Identify needs, satisfy needs, iterate and optimize, and the flywheel starts spinning.

Intent recognition isn’t purely a technical problem. It’s a question of “how well do you really understand your users.” The deeper your understanding, the more reliable your intent taxonomy, and the better your AI product becomes.

In Natural Language Understanding (NLU), intent recognition is typically modeled as a text classification task. But from a product perspective, recognition itself isn’t the goal. Finding the right follow-up action is. This is a significant departure from the traditional accuracy-only evaluation mindset. ↩︎
This approach of “working backwards from follow-up actions to define intents” is essentially reverse-engineering user needs. A similar approach in recommendation systems is “deriving feature importance from target behaviors.” ↩︎
Distribution Shift is a classic problem in machine learning. When the evaluation set distribution doesn’t match the production environment, you get high offline metrics but poor online performance. ↩︎
Cross-annotation is known in the labeling field as Inter-Annotator Agreement (IAA), with common metrics including Cohen’s Kappa and Fleiss’ Kappa. A higher IAA score indicates more consistent annotation standards. ↩︎
This hierarchical intent structure is known in academia as Hierarchical Intent. Related research includes hierarchical intent modeling approaches for task-oriented dialogue systems. ↩︎

Introduction #

What Intent Recognition Really Means #

“Why Hasn’t My Loan Been Approved Yet”: One Case That Explains Everything #

Intent Annotation: From Single Cases to a Global View #

How to Run a Solid Evaluation #

Three Major Pitfalls in Evaluation #

Building an Intent Taxonomy: From Chaos to Structure #

The Real Reason: Improving Fault Tolerance in Recognition Accuracy #

Three Principles for Building an Intent Taxonomy #

A Taxonomy Grows, It Doesn’t Get Drawn #

The Complete Path #