Skip to main content
  1. All Posts/

Advanced RAG: A Deep Dive into Document Chunking Strategies

Aaron
Author
Aaron
I only know that I know nothing.
Table of Contents

Introduction
#

I previously wrote a RAG basics article that covered the full pipeline: chunking, indexing, retrieval, reranking, and generation. In that post I mentioned that chunking granularity requires trade-offs, but didn’t go deeper. Later, in actual projects, I discovered that the chunking strategy has a far bigger impact on the final result than I expected. Same documents, same Embedding model, but a different chunking approach can double your retrieval accuracy. That’s when I realized: the quality of a RAG system’s answers is decided before you even start retrieving. This article dives deep into the document chunking step, comparing five mainstream strategies, their trade-offs, and when to use each one.

Why Chunking Strategy Matters So Much
#

The core of a RAG system is converting knowledge documents into vectors and matching them against user questions. The foundation of that matching quality is how you cut the document into chunks in the first place.

Poor chunking scatters a complete semantic unit across multiple fragments. During retrieval, you might only hit part of it while the other half of the critical information is lost. It’s like slicing a cake: you want a whole strawberry, but it gets chopped into crumbs spread across six slices. No single slice gives you the full flavor.

Chunking quality directly determines the effectiveness of Embedding encoding and retrieval accuracy. This is the most overlooked yet most impactful part of the entire RAG pipeline.

Five Chunking Strategies Compared
#

Fixed-size Chunking
#

The most intuitive approach: split by a preset character count, word count, or Token count. For example, cut every 500 Tokens.

# Fixed-size chunking in LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,  # Overlap between adjacent chunks
    length_function=len,
)
chunks = splitter.split_text(document)

The key technique is the sliding window: keeping partial overlap between adjacent chunks (e.g., 50 Tokens) to mitigate the problem of severed semantics.

The advantages are clear: simplest to implement, uniform chunk sizes, easy to batch process. Ideal as a baseline or for quick validation.

The downsides are equally clear: it breaks sentences and paragraphs, scattering important information across different chunks. A sentence cut in half means retrieval only matches one half, losing the context.

It works, but the answer quality is mediocre. Good for getting started and quick validation, not recommended as a final solution.

Semantic Chunking
#

Fixed-size chunking ignores content entirely. Semantic chunking takes a different approach: split along semantic boundaries.

The process has four steps:

  1. Pre-split the document into meaningful units (sentences, paragraphs, thematic sections)
  2. Generate Embedding vectors for each unit
  3. Calculate cosine similarity between adjacent units
  4. Merge those with high similarity, iterating until similarity drops significantly (indicating a semantic shift)

It’s like listening to someone speak: when they switch from one topic to another, you can feel it. Semantic chunking gets the machine to make that same judgment.

The advantage is preserving natural semantic coherence. Chunks are informationally richer, and retrieval accuracy is higher.

The disadvantage is the dependency on the similarity threshold. Set it too high and chunks become too large; set it too low and they become too fragmented. This threshold requires experience, and different document types may need different values.

Performs relatively well across most scenarios. A strong balance of quality and complexity.

Recursive Chunking
#

A “progressive” approach that works in layers:

  1. Pre-split by paragraphs or sections (simple separators work fine)
  2. Check whether each chunk exceeds the maximum size limit
  3. If it does, split further within the paragraph
  4. Repeat until every chunk is below the preset threshold
# Recursive chunking in LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", "!", "?", "。", "!", "?", " ", ""],
)
chunks = splitter.split_text(document)

The separators parameter defines a priority hierarchy: try double newlines (paragraphs) first, then single newlines (lines), then periods, and so on.

Recursive chunking preserves both natural semantic flow and unit completeness. It’s a compromise that improves on both fixed-size and pure semantic chunking. The downside is higher complexity and computational overhead, since you need to design proper termination conditions and splitting strategies for each level.

Best suited for documents with relatively clean structure but highly variable paragraph lengths.

Structure-based Chunking
#

This approach leverages existing document structure: headings, sections, paragraphs to define chunk boundaries. If a document has a clear hierarchy of “Chapter 1,” “1.1,” “1.1.1,” you split along those lines.

# Chunking by Markdown header structure in LangChain
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_document)

Semantic completeness is the best here, since chunks naturally follow the document’s logical structure.

But the prerequisite is that the document must have a clear hierarchical structure. In reality, many documents have messy structures: inconsistent headings, confused hierarchy, irregular formatting. And structure-based chunks vary wildly in length: one heading might have two lines under it, while another has thousands of words1.

Excellent when the document structure is clean, but in practice that prerequisite often doesn’t hold.

LLM-based Chunking
#

Design a Prompt that lets a large language model automatically generate semantically independent, content-complete chunks. Essentially, you’re having AI read the document and decide how to split it.

# Conceptual example: semantic chunking with an LLM
prompt = """
Split the following text into semantic topics.
Requirements:
1. Each chunk must be a semantically complete topic unit
2. Do not cut in the middle of a sentence
3. Return JSON format: [{"chunk": "...", "topic": "..."}]

Text:
{document}
"""

The highest semantic accuracy. LLMs truly understand context and semantic relationships, producing splits closest to human judgment.

But also the highest computational cost. Every document has to go through an LLM, which is expensive and slow2.

The best quality at the highest price. Suitable for scenarios with strict quality requirements and manageable document volumes.

Quick Reference Table
#

Strategy Core Idea Semantic Integrity Implementation Complexity Compute Cost Best For
Fixed-size Split by Token count Low Low Low Quick validation, baseline
Semantic Split at semantic boundaries High Medium Medium General use, best value
Recursive Progressive layered splitting Medium-High Medium Medium Docs with variable paragraph lengths
Structure-based Split by document hierarchy High Low Low Well-structured docs (e.g., legal texts)
LLM-based LLM understands then splits Highest High Highest High quality requirements, manageable volume

In Practice, It’s Not a Single Choice
#

Five strategies covered, but which one to pick? The answer: you can combine them.

In real-world deployments, chunking strategy typically evolves like this:

Version 1: Start with fixed-size chunking to get the pipeline working and verify the basic flow. Answer quality might not be great, but at least the system runs.

Version 2: Optimize for known issues. For example, add manual rules with regex to match domain-specific terms and ensure they don’t get split. This is especially important for documents in medicine, law, and other fields with lots of specialized terminology.

Eventually, a combination of three strategy types emerges:

Strategy Type Use Case Method
General chunking General knowledge documents Fixed Token length splitting
Domain-specific chunking Documents with specialized terminology Custom strategies to preserve proper nouns
Coarse-fine combined Complex documents Coarse split first (e.g., 5000 Tokens), then fine-grained sub-splitting

The coarse-fine combined approach deserves special mention. First, coarsely split by major sections to ensure each large chunk stays within the same topic. Then, within each large chunk, perform fine-grained sub-splitting. This balances topic integrity with retrieval precision3.

The Token Size Trade-off
#

Regardless of strategy, there’s an inescapable parameter: the Token size of each chunk.

Too large: Semantic understanding suffers. A single paragraph covers too much information to precisely match the user’s specific concern. Like asking someone “What’s the weather in Beijing tomorrow?” and getting the entire week’s forecast.

Too small: You can precisely hit the most relevant sentence, but the number of chunks explodes. Similarity computation increases, system performance degrades, and fragmentation is so severe that contextual information is lost.

The core principle: Token size must balance semantic integrity against computational performance. There’s no universal optimal value; it needs tuning based on document types and usage scenarios4.

What to Keep in Mind When Choosing
#

Every approach has unique strengths and limitations. No single generic solution can address all problems. The strategies don’t conflict either; they can be combined. For example, when using recursive chunking, you might fall back to fixed-size or semantic chunking for particularly long paragraphs5.

Technology selection should consider multiple factors: content characteristics, LLM capabilities, and computational resources. Most importantly, get it working first, then optimize. Don’t chase the perfect solution from day one. Start with a simple strategy to validate the pipeline, then iterate based on real problems.


  1. Structure-based chunking works best with Markdown, HTML, and similar formats where heading hierarchy is explicitly marked. For scanned PDFs or plain text, you need to do structure extraction first. ↩︎

  2. LLM chunking costs can be reduced by using smaller models (like GPT-4o-mini), but semantic understanding capability drops accordingly. In practice, you might use a large model for critical documents and lighter approaches for everything else. ↩︎

  3. The coarse-fine combined approach has a corresponding implementation in LlamaIndex’s HierarchicalNodeParser, which builds documents into a tree structure with parent nodes as coarse chunks and child nodes as fine chunks. ↩︎

  4. Generally, 256–512 Tokens is a good starting point for general knowledge Q&A. Legal, medical, and other specialized domains may need larger chunks (512–1024 Tokens) to maintain context integrity. ↩︎

  5. Both LangChain and LlamaIndex provide a variety of built-in Text Splitters that can be combined as needed. It’s worth reading through the documentation to understand each Splitter’s design philosophy before deciding how to mix and match. ↩︎