Sahi jawab tab milega jab sahi cheez padhi ho. Your documents are being mangled before the LLM even sees them — let's fix that.
Most RAG systems don't fail at the LLM stage — they fail at document ingestion. When your parser mangles a table or misses a scanned page, the LLM hallucinates because the context it received was already broken.
Imagine someone gave you an important letter — but before you read it, a toddler scribbled all over it and tore random pieces out. Now it doesn't matter how smart you are — giving the right answer is almost impossible. That's exactly what basic PDF loaders do in RAG systems.
Multi-column tables get linearized into one stream. Row context is lost, numbers float without headers, merged cells collapse into noise.
Scanned PDFs and image-only pages return empty strings or garbled characters with basic loaders like PyPDF2.
Multi-column layouts, sidebars, footnotes — all merge into one blob. Reading order goes wrong. Context is completely destroyed.
Charts, diagrams, infographics — silently skipped. Critical information that lives only in images is simply gone.
Bad parsing → broken chunks → poor embeddings → wrong retrieval → hallucinated answer. The damage compounds at every stage. Fix parsing and you fix the entire downstream pipeline simultaneously.
Raw files arrive — PDFs, DOCX, PPTX, XLSX, HTML, scanned images. This is where most pipelines make their first (and costliest) mistake.
Advanced parsers analyze layout, extract tables as structured data, run OCR on images, preserve reading order, and output clean Markdown/JSON.
Clean parsed output is split into semantically meaningful pieces. Structure-aware chunking is only possible after proper parsing. Optimal: 256–512 tokens.
Each chunk is converted to a dense vector and stored in a vector database (Pinecone, Qdrant, Weaviate, ChromaDB, etc.).
User query is embedded, relevant chunks retrieved, LLM generates a grounded answer. With clean parsing, accuracy improves dramatically across all steps.
LlamaParse: Fast cloud integration, LlamaIndex stacks, natural language instructions, quick prototyping. Docling: Self-hosting, privacy-sensitive data, local enterprise deployment, full pipeline control. Both output clean Markdown/JSON any RAG framework can consume.
Based on Procycons benchmark (March 2025), Firecrawl comparative analysis (April 2026), and F22 Labs testing (Feb 2026) on financial PDFs and sustainability reports.
| Tool | Hosting | OCR | Table Quality | Cost | Best For |
|---|---|---|---|---|---|
| LlamaParse v2 | Cloud API | ✅ Yes | ⭐⭐⭐⭐ | Free tier + paid | LlamaIndex stacks |
| Docling + Granite | Self-hosted | ✅ Yes | ⭐⭐⭐⭐⭐ | Free (Apache 2.0) | Privacy, enterprise |
| Reducto | Cloud API | ✅ Agentic | ⭐⭐⭐⭐⭐ | Custom pricing | Finance, legal |
| Firecrawl PDF | Cloud API | ✅ Auto | ⭐⭐⭐ | Usage-based | AI agents |
| PyPDF2 / basic | Local lib | ❌ No | ⭐ | Free | Simple text PDFs only |
# New package replaces deprecated llama-parse pip install llama-cloud-services llama-index-core # Get your free API key at cloud.llamaindex.ai export LLAMA_CLOUD_API_KEY="your_api_key_here"
from llama_cloud_services import LlamaParse import os parser = LlamaParse( api_key=os.environ["LLAMA_CLOUD_API_KEY"], result_type="markdown", # or "json" verbose=True, language="en", # 🌟 Unique superpower: natural language parsing instructions parsing_instruction=""" Extract all tables in full. Preserve column headers. Keep all numbers and units exactly as they appear. If a page is scanned, apply OCR carefully. Maintain reading order for multi-column layouts. """ ) # Parse async documents = await parser.aload_data("report.pdf") for doc in documents: print(doc.text) # Clean Markdown — ready for chunking
from llama_cloud_services import LlamaParse from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.node_parser import MarkdownElementNodeParser parser = LlamaParse( result_type="markdown", parsing_instruction="Treat all tables carefully. Preserve structure." ) documents = SimpleDirectoryReader( "./docs", file_extractor={".pdf": parser, ".docx": parser, ".pptx": parser} ).load_data() # Structure-aware chunking — understands Markdown headers + tables node_parser = MarkdownElementNodeParser(num_workers=8) nodes = node_parser.get_nodes_from_documents(documents) index = VectorStoreIndex(nodes) query_engine = index.as_query_engine(similarity_top_k=5) response = query_engine.query("What was the revenue in Q3?") print(response)
# Core install pip install docling # With GPU acceleration (recommended for production) pip install docling[gpu] # Granite-Docling-258M downloads automatically on first use # Or pull manually from HuggingFace: pip install huggingface-hub python -c "from huggingface_hub import snapshot_download; \ snapshot_download('ibm-granite/granite-docling-258m')"
from docling.document_converter import DocumentConverter from docling.datamodel.pipeline_options import PdfPipelineOptions # Granite-Docling-258M is default VLM since January 2026 pipeline_opts = PdfPipelineOptions( do_ocr=True, # OCR for scanned pages do_table_structure=True, # TableFormer model generate_page_images=True, # Extract embedded images ) converter = DocumentConverter(pipeline_options=pipeline_opts) result = converter.convert("report.pdf") doc = result.document markdown = doc.export_to_markdown() # Clean RAG-ready Markdown json_data = doc.export_to_dict() # Full DocTags structure # Access tables as pandas DataFrames! for table in doc.tables: df = table.export_to_dataframe() print(df)
from docling.document_converter import DocumentConverter from docling.chunking import HybridChunker from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain.chains import RetrievalQA import glob converter = DocumentConverter() all_chunks = [] for pdf_path in glob.glob("./docs/*.pdf"): result = converter.convert(pdf_path) # HybridChunker: structure-aware, respects headings + tables chunker = HybridChunker( tokenizer="sentence-transformers/all-MiniLM-L6-v2", max_tokens=512, # 2026 sweet spot overlap_tokens=64, # ~12.5% overlap merge_peers=True ) for chunk in chunker.chunk(dl_doc=result.document): all_chunks.append(Document( page_content=chunk.text, metadata={ "source": pdf_path, "headings": chunk.meta.headings, "page": chunk.meta.doc_items[0].prov[0].page_no } )) vectorstore = Chroma.from_documents(all_chunks, OpenAIEmbeddings()) qa = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4o-mini"), retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), return_source_documents=True ) answer = qa.invoke({"query": "What are the key findings?"}) print(answer["result"])
A January 2026 systematic analysis found a "context cliff" at ~2,500 tokens where response quality drops sharply. Keep chunks under this. Recursive 512-token splitting leads accuracy benchmarks (69%) across 50 academic papers (Vecta, Feb 2026).
| Strategy | Chunk Size | Best For | Accuracy | Cost |
|---|---|---|---|---|
| Recursive / Structural | 256–512 tokens | Most RAG apps | ⭐⭐⭐⭐ Best balance | Low |
| Docling HybridChunker | 512 tokens default | Complex structured docs | ⭐⭐⭐⭐⭐ | Low |
| MarkdownElementParser | Heading-aware | LlamaIndex + LlamaParse | ⭐⭐⭐⭐ | Low |
| Semantic Chunking | Variable | Knowledge bases | ⭐⭐⭐ (can fragment) | High compute |
| Hierarchical | Multi-layer | Agent systems | ⭐⭐⭐⭐ complex queries | High |
Chunk size: 256–512 tokens · Overlap: 10–20% (50–100 tokens) · The 3% retrieval gain from semantic chunking rarely justifies 10× compute cost at scale.
"How hard can it be? Well, it can be very hard."
| Fact | Detail |
|---|---|
| LlamaParse v2 launch | Late 2025 — 4-tier config, up to 50% cost reduction, stable LTS versions |
| llama-parse (PyPI) | Deprecated — replaced by llama-cloud-services. Maintained until May 1, 2026 |
| Granite-Docling-258M | Released January 2026 · Apache 2.0 · Production VLM replacing SmolDocling |
| Docling GitHub stars | 37,000+ as of early 2026 · LF AI & Data Foundation hosted |
| Docling table accuracy | 97.9% on complex tables (Procycons benchmark, March 2025) |
| Optimal chunk size (2026) | 256–512 tokens with 10–20% overlap |
| Context cliff | ~2,500 tokens — retrieval quality degrades sharply above this |
| DocTags format | Captures charts, tables, forms, code, equations in a single pass |
| LlamaSheets (beta) | LlamaIndex tool for merged cells and complex spreadsheets (2026) |
| LlamaSplit (beta) | AI-powered document separation for bundled PDFs (2025) |