← Back to Series / Day 6 of 20
✂️
RAG Series · Day 6

Text Splitting

Why sending raw documents directly to an LLM is a mistake — and how chunk_size and chunk_overlap fix it.

The Problem
📚

Why You Cannot Send Raw Documents Directly

LLMs have a fixed context window — a maximum number of tokens per request. A 500-page book sent directly would overflow it instantly. Even smaller documents cause issues: semantic search quality degrades significantly on long, dense, mixed-topic text.

·Context window limits — LLMs can only process a fixed number of tokens at once
·Better embeddings — Small focused chunks capture semantic meaning far more accurately
·Better search quality — Semantic search is significantly more precise on smaller chunks
Two Key Parameters
⚙️

chunk_size and chunk_overlap

chunk_size
·Maximum number of characters per chunk
·Typical range: 500–1500 characters depending on use case
·Smaller = more precise retrieval. Larger = more context per chunk.
chunk_overlap
·Characters shared between adjacent chunks
·Prevents context from being lost at chunk boundaries
·Good rule: set to 10–20% of your chunk_size
💡 Solid default to start with: chunk_size=1000, chunk_overlap=200. Tune from there based on retrieval quality.
Four Strategies
✂️

Ways to Split Your Text

1. Length-based — Fastest
Splits at fixed character count — extremely fast
Cuts mid-word and mid-sentence — meaning can break
2. Text structure-based — Most popular in production
Tries paragraphs first, then sentences, then words — recursively
Preserves linguistic meaning — best for prose and articles
3. Document structure-based — For code and markup
·Splits Python on class and function boundaries, Markdown on headers
·Respects the logical structure of the document format
4. Semantic meaning-based — Most accurate, still experimental
·Splits where the topic meaning changes — uses embeddings to detect shifts
·Slowest but theoretically the most accurate split strategy

Text splitting is a critical RAG step that is often underestimated. The right chunk size and overlap dramatically affect retrieval quality and answer accuracy. For most production applications, the Recursive Character Text Splitter with structure-based splitting is the reliable default — it respects meaning while staying within size limits.