📂
RAG Series · Day 5
Document Loaders
PDF, website, CSV — load from any source. How LangChain Document Loaders work and which ones to use for each data type.
What Are Document Loaders
📂
The Entry Point of Every RAG Pipeline
Before you can search your data, you need to load it. Data lives everywhere — PDFs, websites, CSVs, databases. Document Loaders in LangChain fetch data from any source and convert it into one standardized format your pipeline can work with.
💡 Every loader outputs a Document Object. One standard format for everything. Learn one loader, understand the pattern for all of them.
Document Object
📄
The Standard Output Format
No matter which loader you use, the output is always a Python list of Document Objects. Each Document Object has exactly two fields.
·page_content — The actual text content extracted from the source
·metadata — Source path, author, creation date, page number, and more
💡 Use metadata for filtering — find only documents from a specific author or search only certain page ranges.
4 Essential Loaders
🔧
Loaders You Will Use in 90% of Projects
Text Loader — Simplest
·Loads plain .txt files — one Document Object for the entire file
·Specify encoding (utf-8) to handle special characters
PyPDF Loader — Most common for documents
·Loads PDF page by page — 25 pages = 25 Document Objects
·Metadata includes page number, title, total pages
·For scanned PDFs use UnstructuredPDFLoader instead
Web Base Loader — For websites
·Fetches any webpage using HTTP and BeautifulSoup under the hood
·Best for static sites — news articles, blogs, documentation
·For JavaScript-heavy sites use SeleniumURLLoader
CSV Loader — For tabular data
·Each row becomes one Document Object — 400 rows = 400 Documents
·Column names and values formatted as key-value pairs in page_content
Memory Efficiency
⚡
load() vs lazy_load()
·load() — Everything into memory at once. Use for small to medium collections.
·lazy_load() — Returns a generator, one document at a time. Use for large collections to avoid memory overflow.
💡 If you have hundreds of PDFs or a massive CSV, always use lazy_load(). It keeps memory usage flat regardless of collection size.
✦
Document Loaders are the entry point of every RAG pipeline. They abstract away the complexity of different data formats and give you a consistent Document Object — page_content plus metadata — regardless of the source. Master one loader and you understand the pattern for all of them.