The Problem
Why You Cannot Send Raw Documents Directly
LLMs have a fixed context window — a maximum number of tokens per request. A 500-page book sent directly would overflow it instantly. Even smaller documents cause issues: semantic search quality degrades significantly on long, dense, mixed-topic text.
·Context window limits — LLMs can only process a fixed number of tokens at once
·Better embeddings — Small focused chunks capture semantic meaning far more accurately
·Better search quality — Semantic search is significantly more precise on smaller chunks
Two Key Parameters
chunk_size and chunk_overlap
chunk_size
·Maximum number of characters per chunk
·Typical range: 500–1500 characters depending on use case
·Smaller = more precise retrieval. Larger = more context per chunk.
chunk_overlap
·Characters shared between adjacent chunks
·Prevents context from being lost at chunk boundaries
·Good rule: set to 10–20% of your chunk_size
💡 Solid default to start with: chunk_size=1000, chunk_overlap=200. Tune from there based on retrieval quality.
Four Strategies
Ways to Split Your Text
1. Length-based — Fastest
✓Splits at fixed character count — extremely fast
✕Cuts mid-word and mid-sentence — meaning can break
2. Text structure-based — Most popular in production
✓Tries paragraphs first, then sentences, then words — recursively
✓Preserves linguistic meaning — best for prose and articles
3. Document structure-based — For code and markup
·Splits Python on class and function boundaries, Markdown on headers
·Respects the logical structure of the document format
4. Semantic meaning-based — Most accurate, still experimental
·Splits where the topic meaning changes — uses embeddings to detect shifts
·Slowest but theoretically the most accurate split strategy
✦
Text splitting is a critical RAG step that is often underestimated. The right chunk size and overlap dramatically affect retrieval quality and answer accuracy. For most production applications, the Recursive Character Text Splitter with structure-based splitting is the reliable default — it respects meaning while staying within size limits.