Day 5 — Document Loaders

What Are Document Loaders

📂

The Entry Point of Every RAG Pipeline

Before you can search your data, you need to load it. Data lives everywhere — PDFs, websites, CSVs, databases. Document Loaders in LangChain fetch data from any source and convert it into one standardized format your pipeline can work with.

💡 Every loader outputs a Document Object. One standard format for everything. Learn one loader, understand the pattern for all of them.

Document Object

📄

The Standard Output Format

No matter which loader you use, the output is always a Python list of Document Objects. Each Document Object has exactly two fields.

·page_content — The actual text content extracted from the source

·metadata — Source path, author, creation date, page number, and more

💡 Use metadata for filtering — find only documents from a specific author or search only certain page ranges.

4 Essential Loaders

🔧

Loaders You Will Use in 90% of Projects

Text Loader — Simplest

·Loads plain .txt files — one Document Object for the entire file

·Specify encoding (utf-8) to handle special characters

PyPDF Loader — Most common for documents

·Loads PDF page by page — 25 pages = 25 Document Objects

·Metadata includes page number, title, total pages

·For scanned PDFs use UnstructuredPDFLoader instead

Web Base Loader — For websites

·Fetches any webpage using HTTP and BeautifulSoup under the hood

·Best for static sites — news articles, blogs, documentation

·For JavaScript-heavy sites use SeleniumURLLoader

CSV Loader — For tabular data

·Each row becomes one Document Object — 400 rows = 400 Documents

·Column names and values formatted as key-value pairs in page_content

Memory Efficiency

⚡

load() vs lazy_load()

·load() — Everything into memory at once. Use for small to medium collections.

·lazy_load() — Returns a generator, one document at a time. Use for large collections to avoid memory overflow.

💡 If you have hundreds of PDFs or a massive CSV, always use lazy_load(). It keeps memory usage flat regardless of collection size.

✦

Document Loaders are the entry point of every RAG pipeline. They abstract away the complexity of different data formats and give you a consistent Document Object — page_content plus metadata — regardless of the source. Master one loader and you understand the pattern for all of them.

← Day 4 Day 5 / 20 Day 6 →