How to Use The Drive AI with LlamaIndex — Document Ingestion and RAG

LlamaIndex is one of the most popular frameworks for building RAG (Retrieval-Augmented Generation) applications. It handles chunking, indexing, and querying. But the quality of your RAG pipeline depends entirely on the quality of your input documents. Feed it messy HTML or broken PDF text, and your answers will reflect that.

The Drive AI converts any URL or file into clean, structured markdown. Pair it with LlamaIndex, and you get a document ingestion pipeline that handles 107+ file types — PDFs, spreadsheets, web pages with JavaScript rendering, scanned documents with OCR — without writing a different loader for each format.

This tutorial walks through building a custom LlamaIndex document loader with The Drive AI, creating a VectorStoreIndex, and querying it with natural language.

Why Use The Drive AI Instead of Built-in LlamaIndex Readers

LlamaIndex ships with several built-in readers. SimpleDirectoryReader handles local files. PDFReader extracts text from PDFs. SimpleWebPageReader fetches web pages. They work for simple cases, but they break down fast.

PDFReader loses structure. Tables become jumbled text. Multi-column layouts merge into nonsense. Headers and footers repeat on every page. If your PDF has charts or images with embedded text, PDFReader skips them entirely.

SimpleWebPageReader does not render JavaScript. Modern web pages load content dynamically. SPAs, dashboards, interactive docs — SimpleWebPageReader fetches the raw HTML before any JavaScript executes. You get a shell with no content.

Each file type needs its own reader. Excel files need one reader, PowerPoint needs another, Google Docs needs an integration. You end up maintaining a patchwork of loaders with different output formats and failure modes.

The Drive AI Markdown API solves all of these with a single endpoint. It renders JavaScript, preserves table structure, runs OCR with vision model proofreading on scanned documents, and returns clean markdown regardless of input format. One reader handles everything.

Setup

Install LlamaIndex and The Drive AI Python SDK:

pip install llama-index thedriveai

Get your API key from dev.thedrive.ai. The free tier includes 100 credits per month. Set it as an environment variable:

export THEDRIVE_API_KEY="tda_live_your_key_here"

Basic Usage: Load a URL as a LlamaIndex Document

The simplest integration uses the Markdown API to convert a URL into a LlamaIndex Document:

import os
import requests
from llama_index.core import Document

THEDRIVE_API_KEY = os.environ["THEDRIVE_API_KEY"]

def load_url_as_document(url: str) -> Document:
    """Convert any URL to a LlamaIndex Document via The Drive AI."""
    response = requests.get(
        f"https://dev.thedrive.ai/md/{url}",
        headers={"X-API-Key": THEDRIVE_API_KEY},
    )
    response.raise_for_status()
    markdown = response.text

    return Document(
        text=markdown,
        metadata={"source": url},
    )

# Load a research paper
doc = load_url_as_document("https://arxiv.org/abs/2310.06825")
print(doc.text[:500])

The API handles the heavy lifting: rendering JavaScript, extracting text from embedded PDFs, preserving table structure, and returning clean markdown. The Document object is ready for LlamaIndex indexing.

Custom Reader Class: DriveAIReader

For production use, wrap this in a proper LlamaIndex reader that extends BaseReader. This gives you a reusable component that integrates with LlamaIndex's standard loading patterns.

import os
from typing import List, Optional
import requests
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader


class DriveAIReader(BaseReader):
    """LlamaIndex reader that uses The Drive AI Markdown API.

    Converts URLs, PDFs, web pages, and 107+ file types
    into clean markdown Documents.
    """

    def __init__(self, api_key: Optional[str] = None):
        self.api_key = api_key or os.environ.get("THEDRIVE_API_KEY")
        if not self.api_key:
            raise ValueError(
                "API key required. Pass api_key or set THEDRIVE_API_KEY."
            )
        self.base_url = "https://dev.thedrive.ai/md"

    def load_data(self, urls: List[str]) -> List[Document]:
        """Load documents from a list of URLs.

        Args:
            urls: List of URLs to convert to Documents.
                  Can be web pages, PDFs, spreadsheets, or
                  any of 107+ supported file types.

        Returns:
            List of LlamaIndex Document objects with
            clean markdown text and source metadata.
        """
        documents = []
        for url in urls:
            response = requests.get(
                f"{self.base_url}/{url}",
                headers={"X-API-Key": self.api_key},
            )
            response.raise_for_status()

            doc = Document(
                text=response.text,
                metadata={
                    "source": url,
                    "loader": "DriveAIReader",
                },
            )
            documents.append(doc)

        return documents

This follows LlamaIndex's BaseReader interface. Any code that works with SimpleDirectoryReader or PDFReader works with DriveAIReader — just swap the reader.

Loading Multiple Sources Through One Reader

The real advantage shows when you load different file types through a single reader. No need to pick the right loader for each format:

reader = DriveAIReader()

# Mix URLs, PDFs, spreadsheets, docs — all through one reader
documents = reader.load_data([
    "https://example.com/annual-report.pdf",
    "https://example.com/pricing",
    "https://docs.google.com/spreadsheets/d/1abc.../edit",
    "https://example.com/blog/product-launch",
    "https://example.com/whitepaper.docx",
])

print(f"Loaded {len(documents)} documents")
for doc in documents:
    print(f"  - {doc.metadata['source']}: {len(doc.text)} chars")

Each document comes back as clean markdown with consistent formatting. Tables are proper markdown tables. Code blocks are fenced. Headers use standard markdown levels. This consistency matters for chunking — LlamaIndex's node parsers work better with well-structured input.

Building a VectorStoreIndex from Drive AI Documents

With documents loaded, create a VectorStoreIndex for semantic search:

from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure LlamaIndex settings
Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents
reader = DriveAIReader()
documents = reader.load_data([
    "https://example.com/docs/getting-started",
    "https://example.com/docs/api-reference",
    "https://example.com/docs/faq",
])

# Build the index
index = VectorStoreIndex.from_documents(documents)

LlamaIndex automatically chunks the documents into nodes, generates embeddings, and stores them in the vector index. Because The Drive AI returns well-structured markdown, the default SentenceSplitter produces cleaner chunks — it can split on markdown headers and paragraph boundaries rather than cutting through mangled HTML.

Querying with a QueryEngine

Build a QueryEngine on top of the index and ask questions in natural language:

query_engine = index.as_query_engine(similarity_top_k=3)

# Ask questions across all loaded documents
response = query_engine.query(
    "What authentication methods does the API support?"
)
print(response)

# Follow up
response = query_engine.query(
    "What are the rate limits for the free tier?"
)
print(response)

The query engine retrieves the most relevant chunks from the index and passes them to the LLM to generate an answer. Because The Drive AI preserved the original document structure — tables, lists, code blocks — the LLM has clean context to work with.

You can customize retrieval parameters:

from llama_index.core.postprocessor import SimilarityPostprocessor

query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7)
    ],
)

Advanced: Structured Metadata Extraction with the Extract API

The Drive AI Extract API pulls structured data from documents using a schema you define. This is useful for attaching rich metadata to LlamaIndex nodes — metadata that improves filtering and retrieval.

import requests
import json
from llama_index.core import Document
from llama_index.core.schema import TextNode


def extract_metadata(url: str, schema: dict) -> dict:
    """Extract structured data from a URL using The Drive AI."""
    response = requests.post(
        "https://dev.thedrive.ai/api/v1/extract",
        headers={
            "X-API-Key": os.environ["THEDRIVE_API_KEY"],
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "schema": schema,
        },
    )
    response.raise_for_status()
    return response.json()

Define a schema for the data you want to extract. The API returns structured results with confidence scores and citations:

schema = {
    "company_name": "string",
    "product_name": "string",
    "pricing_tiers": [
        {
            "tier_name": "string",
            "price": "string",
            "features": ["string"],
        }
    ],
    "last_updated": "string",
}

result = extract_metadata(
    "https://example.com/pricing",
    schema,
)

Attach the extracted metadata to your LlamaIndex nodes for filtered retrieval:

reader = DriveAIReader()
docs = reader.load_data(["https://example.com/pricing"])

# Create nodes with extracted metadata
nodes = []
for doc in docs:
    extracted = extract_metadata(
        doc.metadata["source"], schema
    )
    node = TextNode(
        text=doc.text,
        metadata={
            **doc.metadata,
            "company": extracted.get("company_name", ""),
            "product": extracted.get("product_name", ""),
            "pricing_tiers": json.dumps(
                extracted.get("pricing_tiers", [])
            ),
        },
    )
    nodes.append(node)

# Build index from nodes with metadata
index = VectorStoreIndex(nodes)

# Query with metadata filters
from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(
                key="company", value="Acme Corp"
            )
        ]
    )
)

This combines The Drive AI's extraction accuracy — with confidence scores and source citations — with LlamaIndex's metadata filtering. You get precise retrieval without relying solely on semantic similarity.

Advanced: Using the Analyze API as a LlamaIndex Tool

The Drive AI Analyze API performs multi-step reasoning with Python code execution. You can expose it as a LlamaIndex tool for an agent that needs to analyze documents on the fly:

from llama_index.core.tools import FunctionTool


def analyze_document(url: str, question: str) -> str:
    """Analyze a document using The Drive AI.

    Performs multi-step reasoning with Python execution
    to answer complex questions about any document.
    """
    response = requests.post(
        "https://dev.thedrive.ai/api/v1/analyze",
        headers={
            "X-API-Key": os.environ["THEDRIVE_API_KEY"],
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "question": question,
        },
    )
    response.raise_for_status()
    return response.json().get("answer", "")


# Create a LlamaIndex tool
analyze_tool = FunctionTool.from_defaults(
    fn=analyze_document,
    name="analyze_document",
    description=(
        "Analyze a document at a given URL. Use this for "
        "complex questions that require calculations, "
        "comparisons, or multi-step reasoning over the "
        "document content. Supports PDFs, spreadsheets, "
        "web pages, and 107+ file types."
    ),
)

Use the tool with a LlamaIndex agent:

from llama_index.core.agent import ReActAgent

agent = ReActAgent.from_tools(
    [analyze_tool],
    llm=OpenAI(model="gpt-4o"),
    verbose=True,
)

response = agent.chat(
    "What is the year-over-year revenue growth rate "
    "in https://example.com/financial-report.pdf?"
)
print(response)

The agent decides when to use the Analyze API based on the question complexity. Simple lookups go through the vector index. Complex analytical questions — comparisons, calculations, trend analysis — route to the Analyze API, which can execute Python code to compute the answer.

When to Use DriveAIReader vs Built-in Readers

Use the built-in LlamaIndex readers when:

You have local plain text or simple markdown files. SimpleDirectoryReader handles these fine.
You are working with a well-structured CSV. LlamaIndex's CSVReader parses these correctly.
You need zero external dependencies and your documents are simple.

Use DriveAIReader when:

PDFs have tables, charts, or complex layouts. The Drive AI preserves structure that PDFReader destroys.
Web pages use JavaScript rendering. SPAs, dashboards, interactive documentation — SimpleWebPageReader cannot handle these.
You need one reader for multiple file types. Instead of configuring PDFReader, DocxReader, CSVReader, and SimpleWebPageReader separately, use one reader.
Scanned documents or images contain text. The Drive AI runs OCR with vision model proofreading. Built-in readers cannot do this.
You need consistent markdown output. Different built-in readers produce different output formats. The Drive AI normalizes everything to clean markdown, which makes chunking and retrieval more predictable.
You are ingesting documents from URLs. If your source material lives on the web — documentation sites, research papers, financial filings — DriveAIReader handles the fetching, rendering, and conversion in one step.

Full Working Example

Here is a complete script that loads multiple document types, builds an index, and runs queries:

import os
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Assuming DriveAIReader class is defined as above
# from drive_ai_reader import DriveAIReader

Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small"
)

reader = DriveAIReader()

# Load a mix of sources
documents = reader.load_data([
    "https://example.com/docs/quickstart",
    "https://example.com/api-reference.pdf",
    "https://example.com/changelog",
])

# Build index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

# Query
questions = [
    "How do I authenticate with the API?",
    "What changed in the latest release?",
    "What are the request size limits?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {query_engine.query(q)}\n")

The Drive AI handles the document conversion. LlamaIndex handles the indexing and retrieval. Each tool does what it is best at.

Have questions? Reach out at contact@thedrive.ai.