← Back to Series / Day 5 of 20
📂
RAG Series · Day 5

Document Loaders

PDF, website, CSV — load from any source. How LangChain Document Loaders work and which ones to use for each data type.

What Are Document Loaders
📂

The Entry Point of Every RAG Pipeline

Before you can search your data, you need to load it. Data lives everywhere — PDFs, websites, CSVs, databases. Document Loaders in LangChain fetch data from any source and convert it into one standardized format your pipeline can work with.

💡 Every loader outputs a Document Object. One standard format for everything. Learn one loader, understand the pattern for all of them.
Document Object
📄

The Standard Output Format

No matter which loader you use, the output is always a Python list of Document Objects. Each Document Object has exactly two fields.

·page_content — The actual text content extracted from the source
·metadata — Source path, author, creation date, page number, and more
💡 Use metadata for filtering — find only documents from a specific author or search only certain page ranges.
4 Essential Loaders
🔧

Loaders You Will Use in 90% of Projects

Text Loader — Simplest
·Loads plain .txt files — one Document Object for the entire file
·Specify encoding (utf-8) to handle special characters
PyPDF Loader — Most common for documents
·Loads PDF page by page — 25 pages = 25 Document Objects
·Metadata includes page number, title, total pages
·For scanned PDFs use UnstructuredPDFLoader instead
Web Base Loader — For websites
·Fetches any webpage using HTTP and BeautifulSoup under the hood
·Best for static sites — news articles, blogs, documentation
·For JavaScript-heavy sites use SeleniumURLLoader
CSV Loader — For tabular data
·Each row becomes one Document Object — 400 rows = 400 Documents
·Column names and values formatted as key-value pairs in page_content
Memory Efficiency

load() vs lazy_load()

·load() — Everything into memory at once. Use for small to medium collections.
·lazy_load() — Returns a generator, one document at a time. Use for large collections to avoid memory overflow.
💡 If you have hundreds of PDFs or a massive CSV, always use lazy_load(). It keeps memory usage flat regardless of collection size.

Document Loaders are the entry point of every RAG pipeline. They abstract away the complexity of different data formats and give you a consistent Document Object — page_content plus metadata — regardless of the source. Master one loader and you understand the pattern for all of them.