What is the best chunk size for RAG?

There is no universal best chunk size. The correct granularity depends on answer locality, document structure, query type, embedding behavior, retrieval method, and the generation token budget. Treat size as a tested parameter and compare candidates on representative evidence-labeled queries.

Should RAG chunks always overlap?

No. Overlap can protect facts near artificial boundaries, but it also increases index size and may return duplicate evidence that consumes the generation budget. Structural boundaries, sentence windows, parent-child retrieval, or contextualized embeddings can solve the same problem with different trade-offs.

How should a chunking strategy be evaluated?

Hold the corpus, queries, embedding model, retriever, reranker, and retrieved token budget constant. Measure answer-evidence Recall@k or token-budget recall, ranking quality, duplicate context, citation correctness, answer accuracy, latency, index size, update cost, and cost per correct answer.

RAG Chunking Strategies: How to Evaluate What Works

2026-04-08 - QubitTool Tech Team

Chunking is not a preprocessing detail. It defines the units a retrieval system can find, rank, cite, authorize, update, and send to a model. A poor unit can hide an answer across boundaries; an oversized unit can bury it in unrelated text; aggressive overlap can fill the context budget with duplicates.

There is no universal chunk_size=512 or overlap=20% that solves these trade-offs. Multi-dataset research has found different optimal sizes for concise fact retrieval and broad-context questions, with sensitivity varying across embedding models. That is exactly what production teams should expect: chunking is a retrieval policy learned from a corpus and query distribution, not a style preference.

This guide explains how to define candidate units, compare modern strategies, and run a controlled chunking experiment. Technical claims were reviewed against primary and clearly identified vendor research on July 16, 2026.

Key Takeaways

Define the answer-bearing evidence unit before choosing a token size.
Keep a cheap fixed-token baseline; sophisticated chunking must beat it under the same retrieval token budget.
Structure-aware splitting and embedding-based semantic boundary detection are different techniques.
Overlap is a parameter, not a requirement. Measure boundary recovery against duplicate retrieval and index growth.
Separate the indexed unit from the context unit with parent-child or sentence-window retrieval when useful.
Contextual Retrieval, Late Chunking, and RAPTOR preserve different kinds of context and add different failure modes.
Measure evidence recall, ranking, duplication, citations, answers, latency, index size, and update cost separately.

Start With the Retrieval Contract

Before writing a splitter, define:

Corpus: policies, API docs, legal contracts, tickets, source code, transcripts, or mixed documents.
Queries: exact identifiers, local facts, comparisons, procedures, summaries, or multi-hop questions.
Evidence: the smallest source span that genuinely supports each answer.
Output: quotation, cited answer, summary, generated code, or an action.
Budget: candidates, retrieved tokens, latency, indexing cost, and update frequency.

Consider an employee handbook:

“How many bereavement days are allowed?” may be answered by one sentence.
“Compare parental leave for employees in California and New York” may require two distant sections.
“Summarize all manager obligations during leave” may require a hierarchy or multi-stage retrieval.

One chunk size cannot make all three ideal. The system may need multiple granularities or query-aware expansion.

Also test whether retrieval is needed. For a small, stable corpus that fits comfortably in the model context, full-context prompting with caching may be simpler. Even then, evaluate recall, latency, and access control rather than assuming a larger context window makes retrieval obsolete.

Preserve Structure and Provenance First

Chunking should operate on parsed document units, not an undifferentiated string. Preserve:

document, section, page, paragraph, list, table, figure, and code boundaries;
headings inherited by each unit;
byte or character offsets into the canonical source;
source URI, version, content hash, language, timestamp, and access scope;
parent, child, previous, and next relationships.

Do not blindly remove HTML before understanding it. Navigation and advertisements are noise, but headings, lists, table cells, code blocks, links, and emphasis can carry retrieval meaning. For PDFs, extraction quality and reading order may matter more than the downstream splitter.

A chunk is a view over canonical content. Stable offsets and lineage let the product show citations, propagate deletions, re-index after parser changes, and avoid treating generated summaries as sources of truth. Chunk metadata and parent/child relationships are not authorization boundaries. Recheck the caller, tenant, object, and current retention state when retrieving or expanding a chunk.

Strategy 1: Structural Units With Token Caps

Use native boundaries such as Markdown headings, HTML sections, paragraphs, table rows, transcript turns, or code symbols. Merge adjacent units until a target range is reached; split only an oversized unit using a type-aware fallback.

This is often the strongest first production baseline because it is:

deterministic and explainable;
cheap to build and update;
compatible with exact citations;
less likely to break lists, code blocks, or definitions.

It is structure-aware, not necessarily semantic chunking. A Markdown heading is author-provided structure; an embedding similarity drop is a model-inferred semantic boundary.

Structure can also be misleading. A 30-page section needs a fallback, while poorly authored documents may have no useful headings. Measure rather than assume that preserving every section improves retrieval.

Strategy 2: Fixed-Token Windows

Token windows provide a controlled baseline and guarantee a maximum length for a specific tokenizer. Prefer tokens over characters when the embedding API is token-limited, but record the tokenizer and version.

Overlap protects content near arbitrary cuts, yet creates costs:

more chunks and embedding work;
duplicate top results;
fewer unique evidence spans under a fixed generation budget;
harder score interpretation and citation deduplication.

Run overlap as an ablation such as 0, 32, 64, and 128 tokens. Compare boundary-sensitive recall and unique evidence density. A percentage obscures the actual repeated token count and behaves differently across chunk sizes.

Strategy 3: Semantic Boundary Detection

Embedding-based semantic splitting typically:

segments text into sentences or small units;
embeds each unit;
computes similarity or change between adjacent units;
cuts where the change exceeds a threshold;
enforces minimum and maximum chunk sizes.

It can detect topic transitions missing from formatting, but it is not automatically superior:

every sentence may require an embedding;
thresholds are corpus- and model-specific;
gradual topic changes have no clear boundary;
short references can be separated from their definitions;
re-embedding and threshold changes complicate reproducibility.

Evaluate boundary quality and retrieval outcomes. “Semantically coherent” chunks can still be poor search units if they combine several answer-bearing facts.

Strategy 4: Small-to-Big Retrieval

Index small child units for precise matching, then expand a hit to a larger context unit:

parent-child: child paragraph -> containing section;
sentence window: hit sentence -> neighboring sentences;
region expansion: table cell -> row, headers, and caption;
code expansion: symbol or statement -> enclosing function/class.

The expansion size is query- and document-dependent. A fixed ±2 sentence window is only a baseline. Expand until the evidence is self-contained or until the token budget is reached, then deduplicate overlapping parents.

Small-to-big retrieval separates two competing needs: precise search and sufficient generation context. Its risk is expansion noise: a tiny matching child can pull in a large parent that overwhelms more useful candidates.

Strategy 5: Contextualized Chunks

Some chunks are locally ambiguous:

text

The company increased it by 3 percentage points in Q4.

Contextual Retrieval prepends or indexes a short document-specific context that identifies the subject, entity, time period, or section before creating lexical and dense representations. Anthropic reported fewer failed retrievals on its evaluated corpora, especially when combining contextual embeddings, contextual BM25, and reranking. Those percentages are vendor experiment results, not a universal guarantee.

The added context can improve retrieval, but it can also hallucinate or introduce terms absent from the source. Store generated context separately, retain its model and prompt version, and return the original text as evidence.

Strategy 6: Late Chunking

Conventional dense retrieval splits first and independently embeds each chunk. Late Chunking uses a long-context embedding model to encode the longer document at token level, then pools token embeddings into chunk vectors after contextualization.

Its goal is to let a short chunk representation retain information from surrounding text without generating a textual prefix. It requires:

a model and API that expose suitable token embeddings or native support;
a source length within the embedding model’s effective context;
careful pooling and boundary alignment;
evaluation beyond the paper’s datasets.

The Late Chunking paper is useful research, but its ICLR 2025 submission was rejected with concerns including downstream evaluation. Treat it as a candidate, not an established default.

Strategy 7: Hierarchical Retrieval

Hierarchy serves questions at different abstraction levels. RAPTOR recursively embeds, clusters, and summarizes leaf chunks into a tree, then retrieves from raw leaves and higher-level summaries.

This can help thematic or multi-hop questions that no contiguous leaf answers. It also adds:

LLM-based indexing cost;
summary drift and unsupported abstraction;
complex incremental updates;
more index nodes and retrieval policies;
a need to cite original leaves, not only summaries.

Document-native hierarchies are cheaper when they exist. Use generated hierarchies only when evaluation shows a meaningful gap for cross-section questions.

A Controlled Chunking Experiment

Compare strategies under an equal retrieved token budget, not only equal top_k. Five 200-token chunks and five 1,000-token chunks give the generator very different information and cost.

For each evaluation query, label one or more canonical evidence spans. Then measure:

evidence-span recall within the retrieved token budget;
nDCG or MRR when relevance is graded;
unique evidence tokens / retrieved tokens;
boundary failures;
citation correctness and completeness;
downstream answer accuracy or task rubric;
index size, build time, update time, retrieval latency, and generation cost.

The following standard-library example compares chunks by canonical source offsets. It is intentionally independent of vector databases and model SDKs.

python

from dataclasses import dataclass


@dataclass(frozen=True)
class Span:
    start: int
    end: int

    def __post_init__(self) -> None:
        if self.start < 0 or self.end <= self.start:
            raise ValueError("span must satisfy 0 <= start < end")


def intersection_length(left: Span, right: Span) -> int:
    return max(0, min(left.end, right.end) - max(left.start, right.start))


def merge_spans(spans: list[Span]) -> list[Span]:
    merged: list[Span] = []
    for span in sorted(spans, key=lambda item: (item.start, item.end)):
        if merged and span.start <= merged[-1].end:
            previous = merged[-1]
            merged[-1] = Span(previous.start, max(previous.end, span.end))
        else:
            merged.append(span)
    return merged


def evidence_recall(evidence: list[Span], retrieved: list[Span]) -> float:
    if not evidence:
        raise ValueError("at least one evidence span is required")

    evidence = merge_spans(evidence)
    retrieved = merge_spans(retrieved)
    covered = 0
    total = sum(span.end - span.start for span in evidence)
    for target in evidence:
        boundaries = {target.start, target.end}
        for chunk in retrieved:
            boundaries.add(max(target.start, min(target.end, chunk.start)))
            boundaries.add(max(target.start, min(target.end, chunk.end)))
        points = sorted(boundaries)
        for start, end in zip(points, points[1:]):
            probe = Span(start, end)
            if any(intersection_length(probe, chunk) == end - start for chunk in retrieved):
                covered += end - start
    return covered / total


gold = [Span(100, 160), Span(900, 940)]
retrieved = [Span(80, 140), Span(880, 950)]
print(evidence_recall(gold, retrieved))  # 0.8

Offsets can represent characters, bytes, tokens, table cells, or timestamp ranges, but use one canonical coordinate system consistently. The helper merges overlapping spans before measuring coverage, so duplicated retrieval does not inflate recall. It still measures coverage rather than ranking; pair it with a ranked metric and a duplicate-context metric.

Experimental Design

Use a versioned matrix:

Variable	Candidate values
Boundary policy	token, paragraph, heading, semantic
Target size	corpus-specific grid
Overlap	0 plus several absolute token counts
Retrieval unit	child, parent, dynamic window
Representation	raw, contextual prefix, late embedding
Retriever	lexical, dense, hybrid
Reranker	none, cross-encoder, model-based
Budget	fixed unique retrieved tokens

Hold unrelated variables constant within each comparison. Evaluate by query slice: exact identifiers, local facts, procedures, comparisons, thematic questions, multi-hop questions, and adversarial/irrelevant queries. Report confidence intervals or repeated-run variance where model generation is stochastic.

Use a staged process:

validate parsing and evidence lineage;
optimize retrieval against evidence labels;
add reranking and context expansion;
evaluate generation and citation;
test latency, cost, updates, authorization, and deletion.

This avoids changing the splitter to compensate for a broken parser, weak retriever, or ungrounded generation prompt.

Common Failure Modes

Optimizing only answer quality: generation noise hides retrieval regressions.
Comparing equal top-k: larger chunks receive an unfair token advantage.
Counting duplicate overlap as recall: repeated text consumes budget without adding evidence.
Injecting metadata into source text: generated titles can dominate embeddings; retain original and derived fields separately.
Losing permissions during expansion: a child hit must not retrieve an unauthorized parent.
Chunking tables as prose: preserve headers, rows, units, captions, and coordinates.
Splitting code by characters: use a real parser where available, but test incomplete or syntactically invalid files and preprocessor-heavy languages.
Re-indexing without versioning: store parser, tokenizer, embedding, contextualization prompt, and splitter versions.
Assuming retrieval prevents hallucination: retrieval supplies evidence; the generator can still ignore or misstate it.

Production Checklist

[ ] Real queries are sampled and evidence-labeled.
[ ] Canonical source offsets and provenance survive every transformation.
[ ] A fixed-token and a structure-aware baseline exist.
[ ] Chunk size and overlap are tested, not inherited from a tutorial.
[ ] Strategies are compared under equal unique-token budgets.
[ ] Duplicate retrieval and parent expansion are measured.
[ ] Exact, semantic, local, thematic, and multi-hop slices are reported.
[ ] Citations resolve to original content.
[ ] Permissions and deletion propagate through parents and summaries.
[ ] Index build, update, latency, storage, and answer cost meet budgets.
[ ] Changes are gated by retrieval and end-to-end regression tests.

Frequently Asked Questions

Are smaller chunks always better for factual questions?

No. They often improve localization, but can omit qualifiers, headers, or relationships. Multi-dataset research found smaller chunks stronger for some concise-answer datasets and larger chunks stronger where broad context mattered; results also varied by embedding model.

Is overlap necessary if I split by headings?

Not necessarily. A heading-aware chunk may already preserve boundaries. Test overlap on queries whose evidence crosses sections, and measure whether it adds unique recall or only duplicates.

Does a longer embedding context eliminate chunking?

No. A model accepting a long input can still compress too many facts into one vector. Long-context embedding enables approaches such as Late Chunking, but retrieval granularity and citation units still matter.

Is embedding-based semantic chunking worth the indexing cost?

Only if it improves evidence retrieval or downstream outcomes on your corpus. Compare it with a recursive structure-aware baseline, including build and update cost.

Conclusion

The best chunking strategy is not a number. It is a measured agreement between source structure, answer locality, query distribution, embedding behavior, retrieval policy, and the context budget available to generation.

Start with transparent baselines. Preserve canonical evidence. Compare candidates under equal budgets, and add contextual or hierarchical complexity only when a labeled evaluation set proves the value.

Primary Sources

Previous:RAG vs Fine-tuning: Which LLM Approach to Choose? [2026]

Next:Multimodal RAG: Production Architecture and Evaluation