Day 3 — Document Parsing for RAG | 20-Day RAG Series by Unnati

⚠️

The Real Problem: Broken Input = Broken Output

🔴 Root Cause of Hallucinations

Most RAG systems don't fail at the LLM stage — they fail at document ingestion. When your parser mangles a table or misses a scanned page, the LLM hallucinates because the context it received was already broken.

Imagine someone gave you an important letter — but before you read it, a toddler scribbled all over it and tore random pieces out. Now it doesn't matter how smart you are — giving the right answer is almost impossible. That's exactly what basic PDF loaders do in RAG systems.

📊 Table Destruction

Multi-column tables get linearized into one stream. Row context is lost, numbers float without headers, merged cells collapse into noise.

🔍 OCR Blindness

Scanned PDFs and image-only pages return empty strings or garbled characters with basic loaders like PyPDF2.

📐 Layout Collapse

Multi-column layouts, sidebars, footnotes — all merge into one blob. Reading order goes wrong. Context is completely destroyed.

🖼️ Image Blindness

Charts, diagrams, infographics — silently skipped. Critical information that lives only in images is simply gone.

⚡ The Cascade Effect

Bad parsing → broken chunks → poor embeddings → wrong retrieval → hallucinated answer. The damage compounds at every stage. Fix parsing and you fix the entire downstream pipeline simultaneously.

🔧

Where Parsing Lives in the RAG Pipeline

01

Document Ingestion

Raw files arrive — PDFs, DOCX, PPTX, XLSX, HTML, scanned images. This is where most pipelines make their first (and costliest) mistake.

02

🎯 Intelligent Parsing (Today's Focus)

Advanced parsers analyze layout, extract tables as structured data, run OCR on images, preserve reading order, and output clean Markdown/JSON.

03

Chunking

Clean parsed output is split into semantically meaningful pieces. Structure-aware chunking is only possible after proper parsing. Optimal: 256–512 tokens.

04

Embedding & Indexing

Each chunk is converted to a dense vector and stored in a vector database (Pinecone, Qdrant, Weaviate, ChromaDB, etc.).

05

Retrieval + Generation

User query is embedded, relevant chunks retrieved, LLM generates a grounded answer. With clean parsing, accuracy improves dramatically across all steps.

🛠️

The Two Tools That Actually Fix This

🦙

LlamaParse

by LlamaIndex · Cloud API · v2 (2026)

Cloud-Based Python + TS SDK LLM-Native

GenAI-native — uses LLMs directly for precision parsing
v2 (late 2025): 4-tier config, up to 50% cost reduction, stable LTS
Natural language parsing instructions — tell it how to read your doc
Industry-leading table extraction powered by LLM intelligence
JSON mode: full structured output, tables as CSV
Supports PDF, PPTX, DOCX, XLSX, HTML, images
LlamaSheets beta (2026): handles merged cells & complex spreadsheets
1,000 free pages/day on LlamaCloud

🔬

Docling

by IBM Research · Open Source · Apache 2.0

Open Source Self-Hosted 37k+ Stars

Granite-Docling-258M (Jan 2026): production VLM, Apache 2.0
DocTags: charts, tables, forms, code, equations in one pass
TableFormer model: specialist table structure recognition
Built-in Azure AI Search & Granite integration
97.9% accuracy on complex tables (Procycons 2025)
Local/CPU — data never leaves your infrastructure
Hosted under LF AI & Data Foundation — enterprise governance
Red Hat: "#1 open-source repo for document intelligence"

💡 How to Choose

LlamaParse: Fast cloud integration, LlamaIndex stacks, natural language instructions, quick prototyping. Docling: Self-hosting, privacy-sensitive data, local enterprise deployment, full pipeline control. Both output clean Markdown/JSON any RAG framework can consume.

📊

Parser Benchmark (2025–2026 Data)

Based on Procycons benchmark (March 2025), Firecrawl comparative analysis (April 2026), and F22 Labs testing (Feb 2026) on financial PDFs and sustainability reports.

Complex Table Extraction Accuracy

Docling

97.9%

LlamaParse v2

~92%

Unstructured

75%

PyPDF2 / basic

<20%

Tool	Hosting	OCR	Table Quality	Cost	Best For
LlamaParse v2	Cloud API	✅ Yes	⭐⭐⭐⭐	Free tier + paid	LlamaIndex stacks
Docling + Granite	Self-hosted	✅ Yes	⭐⭐⭐⭐⭐	Free (Apache 2.0)	Privacy, enterprise
Reducto	Cloud API	✅ Agentic	⭐⭐⭐⭐⭐	Custom pricing	Finance, legal
Firecrawl PDF	Cloud API	✅ Auto	⭐⭐⭐	Usage-based	AI agents
PyPDF2 / basic	Local lib	❌ No	⭐	Free	Simple text PDFs only

💻

Code: LlamaParse v2 (2026)

bash Installation — old llama-parse deprecated May 2026

# New package replaces deprecated llama-parse
pip install llama-cloud-services llama-index-core

# Get your free API key at cloud.llamaindex.ai
export LLAMA_CLOUD_API_KEY="your_api_key_here"

python LlamaParse v2 — Basic Usage

from llama_cloud_services import LlamaParse
import os

parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"],
    result_type="markdown",       # or "json"
    verbose=True,
    language="en",
    # 🌟 Unique superpower: natural language parsing instructions
    parsing_instruction="""
        Extract all tables in full. Preserve column headers.
        Keep all numbers and units exactly as they appear.
        If a page is scanned, apply OCR carefully.
        Maintain reading order for multi-column layouts.
    """
)

# Parse async
documents = await parser.aload_data("report.pdf")
for doc in documents:
    print(doc.text)  # Clean Markdown — ready for chunking

python LlamaParse — Full RAG Pipeline with LlamaIndex

from llama_cloud_services import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownElementNodeParser

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="Treat all tables carefully. Preserve structure."
)

documents = SimpleDirectoryReader(
    "./docs",
    file_extractor={".pdf": parser, ".docx": parser, ".pptx": parser}
).load_data()

# Structure-aware chunking — understands Markdown headers + tables
node_parser = MarkdownElementNodeParser(num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=5)

response = query_engine.query("What was the revenue in Q3?")
print(response)

🔬

Code: Docling + Granite-Docling-258M (2026)

bash Installation

# Core install
pip install docling

# With GPU acceleration (recommended for production)
pip install docling[gpu]

# Granite-Docling-258M downloads automatically on first use
# Or pull manually from HuggingFace:
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; \
snapshot_download('ibm-granite/granite-docling-258m')"

python Docling — Basic Conversion

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Granite-Docling-258M is default VLM since January 2026
pipeline_opts = PdfPipelineOptions(
    do_ocr=True,               # OCR for scanned pages
    do_table_structure=True,   # TableFormer model
    generate_page_images=True, # Extract embedded images
)

converter = DocumentConverter(pipeline_options=pipeline_opts)
result = converter.convert("report.pdf")
doc = result.document

markdown = doc.export_to_markdown()   # Clean RAG-ready Markdown
json_data = doc.export_to_dict()       # Full DocTags structure

# Access tables as pandas DataFrames!
for table in doc.tables:
    df = table.export_to_dataframe()
    print(df)

python Docling — Full RAG Pipeline with LangChain

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
import glob

converter = DocumentConverter()
all_chunks = []

for pdf_path in glob.glob("./docs/*.pdf"):
    result = converter.convert(pdf_path)

    # HybridChunker: structure-aware, respects headings + tables
    chunker = HybridChunker(
        tokenizer="sentence-transformers/all-MiniLM-L6-v2",
        max_tokens=512,     # 2026 sweet spot
        overlap_tokens=64,  # ~12.5% overlap
        merge_peers=True
    )

    for chunk in chunker.chunk(dl_doc=result.document):
        all_chunks.append(Document(
            page_content=chunk.text,
            metadata={
                "source":   pdf_path,
                "headings": chunk.meta.headings,
                "page":     chunk.meta.doc_items[0].prov[0].page_no
            }
        ))

vectorstore = Chroma.from_documents(all_chunks, OpenAIEmbeddings())

qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)
answer = qa.invoke({"query": "What are the key findings?"})
print(answer["result"])

✂️

Chunking After Clean Parsing — 2026 Best Practices

✅ 2026 Research Consensus

A January 2026 systematic analysis found a "context cliff" at ~2,500 tokens where response quality drops sharply. Keep chunks under this. Recursive 512-token splitting leads accuracy benchmarks (69%) across 50 academic papers (Vecta, Feb 2026).

Strategy	Chunk Size	Best For	Accuracy	Cost
Recursive / Structural	256–512 tokens	Most RAG apps	⭐⭐⭐⭐ Best balance	Low
Docling HybridChunker	512 tokens default	Complex structured docs	⭐⭐⭐⭐⭐	Low
MarkdownElementParser	Heading-aware	LlamaIndex + LlamaParse	⭐⭐⭐⭐	Low
Semantic Chunking	Variable	Knowledge bases	⭐⭐⭐ (can fragment)	High compute
Hierarchical	Multi-layer	Agent systems	⭐⭐⭐⭐ complex queries	High

📐 Practical Defaults (Feb 2026)

Chunk size: 256–512 tokens · Overlap: 10–20% (50–100 tokens) · The 3% retrieval gain from semantic chunking rarely justifies 10× compute cost at scale.

❌

7 Common Mistakes to Avoid

✗
Using PyPDF2 for complex documents. Any table, scanned page, or multi-column layout will come out broken. Switch to LlamaParse or Docling.
✗
Chunking before validating parse quality. Inspect a sample of parsed output first — you won't catch broken parses until production.
✗
Using fixed-size chunking on structured documents. Splitting a table across two chunks destroys meaning. Use HybridChunker or MarkdownElementNodeParser.
✗
Ignoring reading order. Basic extractors output text in the wrong order for multi-column layouts. Advanced parsers preserve logical reading order.
✗
Skipping metadata. Always store page number, section heading, and source in chunk metadata — essential for tracing hallucinations and building citations.
✗
Not testing on scanned PDFs. Text-based and image-based PDFs need different strategies. Test both types from your specific domain.
✗
Over-relying on semantic chunking. 2026 benchmarks show recursive 512-token splitting beats semantic chunking on most datasets at a fraction of the compute cost.

✅

Production Parsing Checklist

✓
Parser handles all file types in your corpus (PDF, DOCX, PPTX, scanned images)
✓
Tables extracted cleanly — verified with a complex multi-column table from real data
✓
OCR enabled and tested on a scanned page from your specific domain
✓
Reading order correct on multi-column layouts
✓
Parsed output manually inspected for 5–10 sample documents
✓
Structure-aware chunking applied — no tables split mid-chunk
✓
Chunk size validated: 256–512 tokens, 10–20% overlap
✓
Metadata stored: source file, page number, section heading, timestamp
✓
Error handling for corrupt or unreadable files
✓
Pipeline reproducible with pinned library versions and documented config

⚡

Quick Reference — Key Facts 2026

"How hard can it be? Well, it can be very hard."

— Peter Staar, Principal Research Staff Member, IBM Research Zurich · Chair of Technical Steering, LF AI & Data Foundation

Fact	Detail
LlamaParse v2 launch	Late 2025 — 4-tier config, up to 50% cost reduction, stable LTS versions
llama-parse (PyPI)	Deprecated — replaced by `llama-cloud-services`. Maintained until May 1, 2026
Granite-Docling-258M	Released January 2026 · Apache 2.0 · Production VLM replacing SmolDocling
Docling GitHub stars	37,000+ as of early 2026 · LF AI & Data Foundation hosted
Docling table accuracy	97.9% on complex tables (Procycons benchmark, March 2025)
Optimal chunk size (2026)	256–512 tokens with 10–20% overlap
Context cliff	~2,500 tokens — retrieval quality degrades sharply above this
DocTags format	Captures charts, tables, forms, code, equations in a single pass
LlamaSheets (beta)	LlamaIndex tool for merged cells and complex spreadsheets (2026)
LlamaSplit (beta)	AI-powered document separation for bundled PDFs (2025)

Why Your RAG Gives Wrong Answers — And It's Not the LLM's Fault

The Real Problem: Broken Input = Broken Output

Where Parsing Lives in the RAG Pipeline

Document Ingestion

🎯 Intelligent Parsing (Today's Focus)

Chunking

Embedding & Indexing

Retrieval + Generation

The Two Tools That Actually Fix This

Parser Benchmark (2025–2026 Data)

Complex Table Extraction Accuracy

Code: LlamaParse v2 (2026)

Code: Docling + Granite-Docling-258M (2026)

Chunking After Clean Parsing — 2026 Best Practices

7 Common Mistakes to Avoid

Production Parsing Checklist

Quick Reference — Key Facts 2026