How to Convert PDF to Markdown for LLMs

LLMs work best with structured text. PDFs are the opposite — binary containers full of layout instructions, font encodings, and coordinate-based text placement. When you feed raw PDF text into an LLM, you get garbled tables, broken paragraphs, and wasted tokens on formatting artifacts.

Converting PDF to markdown solves this. Markdown preserves document structure (headings, lists, tables) while stripping everything the model does not need. The result: fewer tokens, better comprehension, more accurate answers.

Why markdown matters for LLM pipelines

Raw PDF extraction produces flat text with no structure. A heading looks the same as body text. Table columns merge into a single line. Lists lose their hierarchy.

Markdown fixes each of these problems:

Token efficiency — clean markdown uses 30-50% fewer tokens than raw PDF text for the same content
Structure preservation — headings, lists, and tables map directly to markdown syntax
Consistent formatting — every document follows the same conventions, making downstream parsing predictable
Context window optimization — less noise means you can fit more meaningful content into each LLM call

The hard way: local libraries

Several open-source libraries handle PDF-to-text extraction. Each has trade-offs.

PyMuPDF (fitz) extracts text quickly but produces flat output with no markdown structure. Tables come out as space-separated text. Scanned PDFs return nothing without a separate OCR step.

pdfplumber offers better table detection but still outputs raw text. You need to write custom logic to convert extracted data into markdown. It cannot handle scanned documents at all.

pdf2md and MarkItDown attempt markdown conversion but struggle with complex layouts, multi-column pages, and embedded images. Neither handles OCR for scanned documents.

Unstructured provides a more complete pipeline but requires significant local dependencies (Tesseract, poppler, libmagic) and configuration. Table extraction quality varies.

The common problems across all local approaches:

Scanned PDFs require a separate OCR setup (Tesseract, plus language packs)
Table structure is frequently lost or mangled
Large documents (1000+ pages) cause memory issues
Each library has different failure modes for different PDF types
Maintaining the pipeline means managing system-level dependencies across environments

The API way: one GET request

The Drive AI Markdown API converts any URL-accessible document to clean markdown with a single request:

GET https://dev.thedrive.ai/md/{url}

Pass the URL of any PDF (or web page, or document), and the API returns structured markdown. No local dependencies, no OCR configuration, no memory management.

curl -H "X-API-Key: tda_live_..." \
  "https://dev.thedrive.ai/md/https://example.com/report.pdf"

The API handles the complexity behind the scenes: native text extraction for digital PDFs, OCR with vision model proofreading for scanned documents, and table-aware parsing that preserves structure.

Handling different PDF types

Not all PDFs are the same, and the API adapts to each type automatically.

Native text PDFs — standard digital documents. The API extracts text directly and converts the structure to markdown headings, lists, and paragraphs.

Scanned PDFs — image-based documents from scanners or cameras. The API runs OCR and then uses a vision model to proofread the output, catching errors that traditional OCR misses.

PDFs with tables — financial reports, invoices, research papers. The API detects table boundaries and converts them to proper markdown table syntax with aligned columns and headers.

Large documents — contracts, manuals, regulatory filings with 1000+ pages. The API uses progressive reading to process these without timeout or memory limits.

Table preservation

Tables are where most PDF-to-text tools fail. Here is what raw extraction typically produces:

Name Revenue Growth
Acme Corp $12.4M 23%
Beta Inc $8.7M 15%
Gamma LLC $45.2M 31%

Columns merge. There is no way to tell which number belongs to which company without the original layout context.

The API preserves table structure as markdown:

| Name       | Revenue | Growth |
|------------|---------|--------|
| Acme Corp  | $12.4M  | 23%    |
| Beta Inc   | $8.7M   | 15%    |
| Gamma LLC  | $45.2M  | 31%    |

This format is unambiguous for LLMs and parses correctly in any markdown renderer.

Code examples

Python

from thedriveai import TheDriveAI

client = TheDriveAI(api_key="tda_live_...")

markdown = client.md("https://example.com/report.pdf")
print(markdown)

Or with plain requests:

import requests

url = "https://dev.thedrive.ai/md/https://example.com/report.pdf"
headers = {"X-API-Key": "tda_live_..."}

response = requests.get(url, headers=headers)
markdown = response.text

Node.js

import { TheDriveAI } from "@thedriveai/sdk";

const client = new TheDriveAI({ apiKey: "tda_live_..." });

const markdown = await client.md("https://example.com/report.pdf");
console.log(markdown);

cURL

curl -H "X-API-Key: tda_live_..." \
  "https://dev.thedrive.ai/md/https://example.com/report.pdf"

Batch conversion

For converting multiple PDFs — say, processing an entire folder of quarterly reports — loop through URLs and collect the markdown:

import requests

api_key = "tda_live_..."
headers = {"X-API-Key": api_key}

pdf_urls = [
    "https://example.com/q1-report.pdf",
    "https://example.com/q2-report.pdf",
    "https://example.com/q3-report.pdf",
]

results = {}
for url in pdf_urls:
    response = requests.get(
        f"https://dev.thedrive.ai/md/{url}",
        headers=headers,
    )
    results[url] = response.text

Each conversion uses 1 credit. The free tier includes 100 conversions per month. Beyond that, credits cost $0.01 each on the Pro plan.

Local tools vs API: when to use which

Factor	Local libraries	Markdown API
Setup	Install system deps (Tesseract, poppler)	None — one HTTP call
Scanned PDFs	Requires separate OCR config	Built-in OCR + vision proofreading
Tables	Often broken or lost	Preserved as markdown tables
Large files	Memory limits, crashes	Progressive reading, no limits
Maintenance	Version conflicts, OS-specific issues	Managed service
Cost	Free (your compute)	Free 100/month, then $0.01/credit
Offline use	Yes	No
Data sensitivity	Stays local	Sent to API

Use local tools when you need offline processing or cannot send documents to an external service. Use the API for everything else — it handles edge cases that would take weeks to solve locally.

Getting started

Get a free API key at thedrive.ai
Install the SDK: pip install thedriveai or npm install @thedriveai/sdk
Convert your first PDF with a single call
Integrate into your LLM pipeline — feed the markdown directly into your prompt context

The free tier gives you 100 conversions per month with no credit card required.

Have questions? Reach out at contact@thedrive.ai.

How to Convert PDF to Markdown for LLMs — API Guide