Introduction #
My company has recently started planning an intelligent customer service system. After some research, I kept encountering the term RAG everywhere. Enterprise knowledge assistants, document Q&A, smart chatbots, they all rely on it. Since we’re going to use it, I figured I should understand how it actually works under the hood. I spent some time studying the full RAG pipeline, from data preparation to final answer generation, covering chunking, embedding, vector databases, retrieval, reranking, and more. This post documents what I learned, and I hope you find it helpful too.
Why RAG #
Suppose you want to build an intelligent customer service bot that answers questions about your company’s products. The most intuitive approach is to pick a large language model, then send it the product manual along with the user’s question.
Sounds fine at first, but in practice you’ll run into several hard problems:
- Limited context window: Every model has an upper limit on how much input it can accept. A product manual can easily run hundreds of pages. If it doesn’t fit, it doesn’t fit. Forcing it in will cause the model to lose track of earlier content, and answer quality will drop sharply.
- High inference cost: The longer the input, the more tokens are consumed. Attaching an entire manual to every single query will produce some eye-watering bills by the end of the day.
- Slow inference speed: More input means more content for the model to digest. Users could be waiting a long time for a single answer.
The core tension is this: we want the model to see all relevant information, but we can’t feed it everything. So what’s the alternative?
RAG’s approach is: instead of sending the entire document to the model, only send the parts relevant to the question. Specifically, split the document into chunks first, then when a user asks a question, retrieve the most relevant chunks and pass them along with the question to the LLM to generate an answer.
The Overall RAG Pipeline #
The RAG workflow is divided into two main phases:
- Data preparation phase (before the user asks a question): includes chunking and indexing.
- Answering phase (after the user asks a question): includes retrieval, reranking, and generation.
Let’s break down each step.
1. Chunking #
Chunking means splitting a long document into multiple smaller segments. There are many ways to do it: by character count, by paragraph, by section, by page, or even using semantic-aware intelligent splitting.
Choosing the right chunk size is a tradeoff. If chunks are too large, they carry irrelevant noise. If too small, they may lose important context, and the model will see disconnected fragments. In practice, finding the right granularity often requires repeated experimentation.
2. Indexing #
Indexing is the core of the data preparation phase. It does exactly two things:
- Use an Embedding model to convert each text chunk into a vector.
- Store both the original text chunk and its corresponding vector in a vector database.
Sounds simple, but to truly understand these two steps, we need to clarify three concepts first: vectors, embedding, and vector databases.
Vectors #
A vector is a fundamental concept in mathematics that represents a quantity with both magnitude and direction. In code, we represent it as an array. The number of elements in the array is the vector’s dimensionality.
One-dimensional vectors can be plotted on a single axis, 2D vectors on a plane, and 3D vectors in a 3D coordinate system. But the vectors used in RAG typically have hundreds or even thousands of dimensions. They can’t be visualized directly, but they exist nonetheless. Generally, higher dimensionality means richer semantic information and better retrieval accuracy.
Embedding #
Embedding is the process of converting text into vectors. Its key property is: texts with similar meanings produce vectors that are close to each other.
For example, using 2D vectors for illustration:
| Text | Vector |
|---|---|
| Python is a programming language | (2, 3) |
| An introduction to the Python language | (2, 2) |
| Nice weather for a hike today | (-3, -1) |
The first two sentences have very similar vectors, while the third is far away. This is because the first two are semantically close, while the third is completely unrelated.
With this property, when a user asks “What is Python?”, we first embed the question into a vector, then use vector similarity to find nearby vectors and locate the relevant text chunks.
It’s worth noting that embedding is performed by specialized Embedding models, not generative models like GPT-4 Turbo. The MTEB leaderboard1 evaluates and ranks various Embedding models and is a good reference.
Vector Databases #
A vector database is a database specifically designed for storing and querying vectors. It’s optimized for vector storage and provides functions for computing vector similarity and related operations.
It’s important to emphasize that what goes into the vector database is not just vectors, but also the original text. This is because what we ultimately feed to the LLM is the original text. Vectors are just an intermediate product for retrieval. So a vector database table typically has at least two columns: original text and corresponding vector.
Back to the indexing step: the entire process is to embed each chunk one by one, then write both the text and the vector into the vector database, until all chunks are processed.
3. Retrieval #
Retrieval is the first step after a user asks a question. Its goal is to find the most relevant chunks from the entire collection.
The process is as follows:
- Send the user’s question to an Embedding model to convert it into a vector.
- Send this vector to the vector database.
- The vector database computes similarity between this vector and all stored chunk vectors, then returns the top N most similar chunks (e.g., 10).
Vector Similarity Computation #
How does the vector database determine which chunks are most relevant to the user’s question? By computing vector similarity. There are three common methods:
- Cosine Similarity: Computes the cosine of the angle between two vectors. The smaller the angle, the higher the similarity.
- Euclidean Distance: Computes the straight-line distance between two vectors. The smaller the distance, the higher the similarity.
- Dot Product: Considers both direction and magnitude of vectors. Same direction with larger magnitudes produces a larger dot product and higher similarity.
The vector database computes similarity between the question vector and every chunk vector, sorts the results, and returns the top N.
4. Reranking #
Reranking further filters the chunks returned by the retrieval stage, selecting the most relevant few (e.g., 3).
You might wonder: if we only need 3 chunks at the end, why not just take 3 directly from retrieval? Why have two separate stages?
The reason is that the two stages use different similarity computation methods, with different tradeoffs between accuracy and cost:
| Stage | Method | Cost | Latency | Accuracy | Role |
|---|---|---|---|---|---|
| Retrieval | Vector similarity (cosine, Euclidean, etc.) | Low | Fast | Lower | Coarse filter |
| Reranking | Cross Encoder model | High | Slow | High | Fine selector |
Think of it like a company hiring process: retrieval is like HR screening resumes, quickly picking 10 promising ones from thousands; reranking is like the department interview, deeply evaluating those 10 candidates and selecting the best 3. Each round serves its purpose, balancing efficiency and accuracy.
Cross Encoder2 is a model specifically designed for text pair similarity scoring. It takes two texts as input simultaneously and outputs a similarity score directly. Its accuracy is much higher than plain vector similarity, but the computational cost is also higher, which is why it’s only used on a small number of candidate chunks.
5. Generation #
The final step is straightforward. We have the user’s question and the 3 most relevant chunks from reranking. We assemble them into a prompt and send it to the LLM (e.g., GPT-4 Turbo), which generates an answer based on the provided chunks.
There are also some Prompt engineering techniques involved here, such as how to organize the context, how to guide the model to answer only based on the provided chunks, and how to handle cases where “no answer can be found.” But these are advanced topics beyond the scope of this article.
Full Pipeline Recap #
Data preparation before asking:
- Split relevant documents into chunks.
- Convert each chunk into a vector using an Embedding model.
- Store the text chunks and their vectors in a vector database.
Answering pipeline after asking:
- Convert the user’s question into a vector using an Embedding model.
- Query the vector database for the 10 most similar chunks (Retrieval).
- Use a Cross Encoder to rerank these 10 chunks and select the 3 most relevant ones (Reranking).
- Send these 3 chunks along with the user’s question to the LLM to generate the final answer (Generation).
That’s the complete RAG workflow. Understanding the principles behind each stage will give you a clearer direction when it comes to choosing Embedding models, tuning chunking strategies, or optimizing retrieval quality.
-
MTEB (Massive Text Embedding Benchmark) is a comprehensive benchmark for evaluating Embedding models across multiple dimensions including classification, clustering, and retrieval. ↩︎
-
Cross Encoder is a model architecture that jointly encodes two concatenated texts. Compared to Bi-Encoder (which encodes texts separately then computes similarity), it can capture deeper semantic relationships but cannot pre-compute vectors, making it slower. ↩︎