Blog
11 min read

How to Use The Drive AI with LangChain — Document Loading, Extraction, and RAG

LangChain is the most popular framework for building applications with large language models. At some point, every LangChain project needs to load documents — PDFs, websites, spreadsheets, scanned images — and turn them into something an LLM can work with.

The built-in LangChain document loaders get you started, but they each have limitations that surface fast in production. PyPDFLoader drops tables and misreads multi-column layouts. WebBaseLoader fetches raw HTML without rendering JavaScript, missing content on modern single-page apps. UnstructuredLoader requires heavy local dependencies and still struggles with OCR.

The Drive AI solves this with a single API that handles 107+ file types, renders JavaScript, runs OCR with vision model proofreading, and returns clean markdown. This tutorial shows you how to integrate it into LangChain as a custom document loader and build a complete RAG pipeline.

Why Use The Drive AI with LangChain

Here is what typically goes wrong with the standard LangChain loaders:

PyPDFLoader extracts text line by line. A two-column academic paper becomes garbled text with sentences from column A interleaved with column B. Tables lose their structure entirely. Headers and footers repeat on every page.

WebBaseLoader uses requests to fetch HTML, then strips tags with BeautifulSoup. Any website that loads content via JavaScript — React apps, dynamic dashboards, paginated listings — returns an empty or partial document.

UnstructuredLoader is more capable but requires installing poppler, tesseract, libmagic, and other system libraries. Deploying it in a Docker container or serverless function adds complexity. Even then, OCR quality on scanned documents is inconsistent.

The Drive AI Markdown API handles all of these cases through a single endpoint. You send a URL or file, and you get back clean, structured markdown with tables preserved, JavaScript rendered, and OCR text proofread by a vision model. One loader, one dependency, every file type.

Setup

Install the required packages:

pip install langchain langchain-openai thedriveai chromadb

Set your API keys. You can get a Drive AI API key from dev.thedrive.ai — the free tier includes 100 credits per month.

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
DRIVE_AI_API_KEY = "tda_live_..."

Basic Usage — Load a URL as a LangChain Document

The simplest integration uses the Markdown API to convert any URL into a LangChain Document:

import requests
from langchain_core.documents import Document

def load_url_as_document(url: str, api_key: str) -> Document:
    """Load any URL as a LangChain Document using The Drive AI Markdown API."""
    response = requests.get(
        f"https://dev.thedrive.ai/md/{url}",
        headers={"X-API-Key": api_key},
    )
    response.raise_for_status()
    data = response.json()

    return Document(
        page_content=data["markdown"],
        metadata={
            "source": url,
            "title": data.get("title", ""),
            "content_type": data.get("content_type", ""),
        },
    )

# Load a research paper PDF
doc = load_url_as_document(
    "https://arxiv.org/pdf/2005.11401",
    DRIVE_AI_API_KEY,
)

print(doc.page_content[:500])
print(doc.metadata)

This works for any URL — PDFs, web pages, Google Docs links, public Notion pages, and more. The API handles rendering, extraction, and cleanup.

Custom LangChain DocumentLoader Class

For production use, wrap this into a proper LangChain BaseLoader. This gives you compatibility with LangChain's loader interface, including lazy loading and integration with chains.

from typing import Iterator, List, Optional
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
import requests


class DriveAILoader(BaseLoader):
    """LangChain document loader that uses The Drive AI Markdown API.

    Loads PDFs, websites, DOCX, XLSX, images, and 107+ file types
    as clean markdown Documents.
    """

    def __init__(
        self,
        urls: List[str],
        api_key: str,
        *,
        base_url: str = "https://dev.thedrive.ai",
    ):
        self.urls = urls
        self.api_key = api_key
        self.base_url = base_url

    def lazy_load(self) -> Iterator[Document]:
        """Load documents one at a time, yielding each as a Document."""
        for url in self.urls:
            response = requests.get(
                f"{self.base_url}/md/{url}",
                headers={"X-API-Key": self.api_key},
            )
            response.raise_for_status()
            data = response.json()

            yield Document(
                page_content=data["markdown"],
                metadata={
                    "source": url,
                    "title": data.get("title", ""),
                    "content_type": data.get("content_type", ""),
                },
            )

Use it like any other LangChain loader:

loader = DriveAILoader(
    urls=[
        "https://arxiv.org/pdf/2005.11401",
        "https://example.com/annual-report.pdf",
        "https://docs.google.com/document/d/1abc.../edit",
    ],
    api_key=DRIVE_AI_API_KEY,
)

# Lazy load — memory efficient for large document sets
for doc in loader.lazy_load():
    print(f"Loaded: {doc.metadata['source']} ({len(doc.page_content)} chars)")

# Or load all at once
documents = loader.load()

Loading Multiple Document Types Through One Loader

One of the practical advantages over built-in loaders: you do not need a different loader class for each file type. The same DriveAILoader handles PDFs, websites, Word documents, Excel spreadsheets, PowerPoint files, images with text, and scanned documents.

# One loader, every file type
loader = DriveAILoader(
    urls=[
        # PDF research paper
        "https://arxiv.org/pdf/2005.11401",
        # Live website with JavaScript rendering
        "https://openai.com/index/gpt-4-research",
        # Word document
        "https://example.com/contract.docx",
        # Excel spreadsheet
        "https://example.com/financials.xlsx",
        # Scanned document (OCR + vision model proofreading)
        "https://example.com/handwritten-notes.png",
    ],
    api_key=DRIVE_AI_API_KEY,
)

documents = loader.load()

Compare this to the equivalent with built-in loaders, where you would need PyPDFLoader, WebBaseLoader, Docx2txtLoader, UnstructuredExcelLoader, and a separate OCR pipeline — each with its own dependencies and quirks.

Chunking Strategies with RecursiveCharacterTextSplitter

Once documents are loaded as markdown, you need to split them into chunks for embedding. LangChain's RecursiveCharacterTextSplitter works well here because the Drive AI output is structured markdown with headers, lists, and tables.

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)

# Option 1: Split by markdown headers first, then by size
# This preserves document structure in your chunks
headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
)

# Split each document by headers
header_splits = []
for doc in documents:
    splits = md_splitter.split_text(doc.page_content)
    for split in splits:
        # Preserve the original source metadata
        split.metadata.update(doc.metadata)
    header_splits.extend(splits)

# Then split any large sections by character count
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

final_chunks = text_splitter.split_documents(header_splits)
print(f"Split {len(documents)} documents into {len(final_chunks)} chunks")
# Option 2: Simple recursive splitting (faster, works well for most cases)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(documents)

Because the Drive AI output is clean markdown rather than raw extracted text, the splitter can use structural boundaries (headers, paragraph breaks) effectively. This produces higher-quality chunks compared to splitting garbled PDF text.

Full RAG Pipeline

Here is a complete retrieval-augmented generation pipeline: load documents with Drive AI, chunk them, embed them into a vector store, and query with a retrieval chain.

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# 1. Load documents
loader = DriveAILoader(
    urls=[
        "https://arxiv.org/pdf/2005.11401",  # RAG paper
        "https://arxiv.org/pdf/2312.10997",  # Self-RAG paper
    ],
    api_key=DRIVE_AI_API_KEY,
)
documents = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = text_splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Build retrieval chain
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

system_prompt = (
    "You are a research assistant. Use the following retrieved context "
    "to answer questions. Cite the source document when possible. "
    "If the context does not contain the answer, say so.\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# 5. Query
result = rag_chain.invoke({
    "input": "How does RAG compare to fine-tuning for knowledge-intensive tasks?"
})

print(result["answer"])
print(f"\nSources: {[doc.metadata['source'] for doc in result['context']]}")

This pipeline takes about 20 lines of code beyond the standard LangChain boilerplate, and it handles any document type you throw at it.

Advanced: Extract API as a Structured Output Tool

The Drive AI Extract API goes beyond raw text extraction. It takes a schema and returns structured data with confidence scores and citations — useful as a LangChain tool for agents that need to pull specific fields from documents.

from langchain_core.tools import tool
import requests
import json


@tool
def extract_from_document(url: str, fields: str) -> str:
    """Extract structured data from a document using The Drive AI.

    Args:
        url: URL of the document to extract from.
        fields: Comma-separated list of fields to extract
                (e.g., "company_name, revenue, date").
    """
    # Build schema from field names
    field_list = [f.strip() for f in fields.split(",")]
    schema = {
        field: {"type": "string", "description": f"The {field} from the document"}
        for field in field_list
    }

    response = requests.post(
        "https://dev.thedrive.ai/api/v1/extract",
        headers={
            "X-API-Key": DRIVE_AI_API_KEY,
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "schema": schema,
        },
    )
    response.raise_for_status()
    result = response.json()

    # Format results with confidence scores
    output_lines = []
    for field, value in result.get("data", {}).items():
        confidence = result.get("confidence", {}).get(field, "N/A")
        citation = result.get("citations", {}).get(field, "")
        output_lines.append(
            f"{field}: {value} (confidence: {confidence})"
        )
        if citation:
            output_lines.append(f"  Source: {citation}")

    return "\n".join(output_lines)

Use this tool in a LangChain agent:

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

tools = [extract_from_document]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a document analysis assistant. Use the extract tool "
               "to pull structured data from documents when asked."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({
    "input": "Extract the company name, total revenue, and fiscal year "
             "from this annual report: https://example.com/annual-report.pdf"
})

print(result["output"])

The Extract API returns confidence scores for each field, so your agent can flag low-confidence extractions for human review rather than silently passing bad data downstream.

Advanced: Analyze API as a LangChain Tool for Document Q&A

The Analyze API goes further — it performs multi-step reasoning over documents, including Python code execution for calculations and data analysis. This is useful when a question requires more than retrieval; it requires computation.

@tool
def analyze_document(url: str, question: str) -> str:
    """Analyze a document with multi-step reasoning using The Drive AI.

    Handles questions that require calculations, comparisons,
    or multi-step reasoning over document content.

    Args:
        url: URL of the document to analyze.
        question: The question to answer about the document.
    """
    response = requests.post(
        "https://dev.thedrive.ai/api/v1/analyze",
        headers={
            "X-API-Key": DRIVE_AI_API_KEY,
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "question": question,
        },
    )
    response.raise_for_status()
    result = response.json()

    return result.get("answer", "No answer returned.")

Combine both tools in a single agent for a document assistant that can both extract structured data and answer complex analytical questions:

tools = [extract_from_document, analyze_document]

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a document analysis assistant with two capabilities:\n"
     "1. extract_from_document: Pull specific fields from documents\n"
     "2. analyze_document: Answer complex questions that require "
     "reasoning or calculations\n\n"
     "Choose the right tool based on the user's question."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Extraction question
executor.invoke({
    "input": "What are the key terms in this contract? "
             "https://example.com/contract.pdf"
})

# Analytical question
executor.invoke({
    "input": "What was the year-over-year revenue growth rate in this report? "
             "https://example.com/annual-report.pdf"
})

When to Use Drive AI Loader vs Built-in Loaders

Not every project needs The Drive AI loader. Here is a practical decision guide:

Use built-in LangChain loaders when:

  • You are loading plain text or simple markdown files (use TextLoader)
  • You are loading CSVs for tabular analysis (use CSVLoader)
  • You are working offline with no API access
  • You are loading from LangChain-supported data stores like S3, GCS, or databases that have dedicated loaders

Use The Drive AI loader when:

  • You are loading PDFs with tables, multi-column layouts, or mixed content
  • You need to load web pages that rely on JavaScript rendering
  • You are building a pipeline that handles multiple file types (PDF, DOCX, XLSX, images) and you want one loader instead of five
  • You need OCR on scanned documents or images with text
  • You want clean markdown output that chunks well for RAG
  • You are in production and need reliable, consistent output across document types

Use The Drive AI Extract API when:

  • You need structured fields from documents (names, dates, amounts) rather than full text
  • You want confidence scores on extracted values
  • You are building an agent that needs to pull specific data points from arbitrary documents

Use The Drive AI Analyze API when:

  • The question requires multi-step reasoning over document content
  • You need calculations or comparisons that go beyond simple retrieval
  • You want the API to handle the reasoning chain rather than building it in LangChain

The free tier at 100 credits per month is enough to prototype and test. For production workloads, the Pro plan at $0.01 per credit keeps costs predictable — a typical document load costs 1 credit.

Complete Example: Multi-Source Research Assistant

Here is a full working example that combines everything — loading documents from multiple sources, building a vector store, and running queries with source attribution:

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load documents from mixed sources
loader = DriveAILoader(
    urls=[
        "https://arxiv.org/pdf/2005.11401",
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://example.com/internal-research-notes.docx",
    ],
    api_key=DRIVE_AI_API_KEY,
)

documents = loader.load()

# Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)

# Embed and store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Build chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Answer the question based on the provided context. "
     "Cite sources by their URL when referencing specific information.\n\n"
     "{context}"),
    ("human", "{input}"),
])

chain = create_retrieval_chain(
    retriever,
    create_stuff_documents_chain(llm, prompt),
)

# Query
response = chain.invoke({
    "input": "Compare the approaches to retrieval augmentation "
             "discussed across these sources."
})

print(response["answer"])

The Drive AI loader handles the hard part — turning heterogeneous sources into uniform, high-quality markdown — so you can focus on the LangChain application logic.

Have questions? Reach out at contact@thedrive.ai.

Share it with your network