Extract Structured Data from Any Document with One API Call

The Real Cost of Unstructured Documents

Every business runs on documents — invoices, contracts, receipts, medical forms, property listings. The data inside them is structured, but the format is not. Extracting that data means regex, templates, manual entry, or expensive platforms that charge per page and still break when the layout changes.

Traditional extraction tools rely on template matching. They work until a vendor sends an invoice in a slightly different format. Then they fail silently, returning wrong data with high confidence — the worst kind of bug.

LLM-grounded extraction solves this. Instead of matching templates, it reads the document the way a human would, understands context, and adapts to layout variations. Studies show template-based extractors drop 12-28 F1 points on new document formats. LLM-based tools drop 2-6.

We built an extraction API around this approach.

How It Works

The Drive AI Extract API takes a document and a schema, and returns typed JSON with confidence scores and source citations.

POST https://dev.thedrive.ai/api/v1/extract

Define What You Want

Send a schema describing the fields you need. Each field has a type, a description, and an optional required flag:

{
  "url": "https://example.com/invoice-2024-003.pdf",
  "schema": {
    "vendor_name": {
      "type": "string",
      "description": "Company or person who issued the invoice",
      "required": true
    },
    "invoice_date": {
      "type": "string",
      "description": "Date the invoice was issued (ISO 8601)"
    },
    "total_amount": {
      "type": "number",
      "description": "Total amount due including tax"
    },
    "line_items": {
      "type": "array",
      "description": "List of items with description, quantity, and unit price"
    },
    "payment_terms": {
      "type": "enum",
      "description": "Payment terms",
      "options": ["net_15", "net_30", "net_60", "due_on_receipt"]
    }
  }
}

Get Structured Data Back

The response includes the extracted data, a confidence score for each field, and citations pointing to the exact text the data was pulled from:

{
  "data": {
    "vendor_name": "Acme Corp",
    "invoice_date": "2024-11-15",
    "total_amount": 4750.00,
    "line_items": [
      { "description": "Consulting — November", "quantity": 40, "unit_price": 100 },
      { "description": "Travel expenses", "quantity": 1, "unit_price": 750 }
    ],
    "payment_terms": "net_30"
  },
  "confidence": {
    "vendor_name": "high",
    "invoice_date": "high",
    "total_amount": "high",
    "line_items": "high",
    "payment_terms": "medium"
  },
  "citations": {
    "vendor_name": "Acme Corp, Inc. — 123 Business Ave, Suite 400",
    "invoice_date": "Invoice Date: November 15, 2024",
    "total_amount": "Total Due: $4,750.00",
    "payment_terms": "Terms: Net 30 days from invoice date"
  }
}

Every field is traceable. If the confidence is medium or low, you can inspect the citation to understand why and decide how to handle it in your pipeline.

Authentication

curl -X POST https://dev.thedrive.ai/api/v1/extract \
  -H "X-API-Key: tda_live_..." \
  -H "Content-Type: application/json" \
  -d '{ "url": "...", "schema": { ... } }'

What You Can Extract From

The API handles documents, images, and websites through the same endpoint.

Source Type	Examples	Credits
Documents	PDF, DOCX, DOC, ODT, RTF, EPUB, TXT	1 credit/page
Spreadsheets	XLSX, XLS, ODS, CSV, TSV	1 credit/page
Presentations	PPTX, PPT, ODP, KEY	1 credit/page
Images	JPG, PNG, TIFF, HEIC (via OCR)	1 credit/page
Websites	Any public URL	5 credits/site

Scanned documents and images go through OCR with vision model proofreading — so even a photo of a handwritten form or a stamped receipt gets extracted accurately.

Real-World Extraction Patterns

Invoice Processing

Automate accounts payable by extracting vendor, amounts, dates, and line items from invoices in any format:

from thedriveai import TheDriveAI

client = TheDriveAI(api_key="tda_live_...")

result = client.extract(
    url="https://storage.example.com/invoices/INV-2024-003.pdf",
    schema={
        "vendor": {"type": "string", "required": True},
        "invoice_number": {"type": "string", "required": True},
        "date": {"type": "string", "description": "ISO 8601 date"},
        "total": {"type": "number"},
        "line_items": {"type": "array", "description": "description, qty, unit_price, amount"},
        "tax_amount": {"type": "number"},
    }
)

# Route based on confidence
if all(c == "high" for c in result.confidence.values()):
    auto_process(result.data)
else:
    flag_for_review(result.data, result.confidence)

Contract Review

Extract key terms from legal documents — parties, dates, obligations, and risk clauses:

import { TheDriveAI } from '@thedriveai/sdk';

const client = new TheDriveAI({ apiKey: 'tda_live_...' });

const result = await client.extract({
  url: 'https://storage.example.com/contracts/vendor-agreement.pdf',
  schema: {
    parties: { type: 'array', description: 'Names of all parties to the agreement' },
    effective_date: { type: 'string', description: 'When the contract takes effect' },
    termination_date: { type: 'string', description: 'When the contract expires' },
    liability_cap: { type: 'number', description: 'Maximum liability amount in USD' },
    governing_law: { type: 'string', description: 'Jurisdiction governing the contract' },
    auto_renewal: { type: 'boolean', description: 'Whether the contract auto-renews' },
    notice_period_days: { type: 'number', description: 'Days of notice required for termination' },
  },
});

Lead Enrichment from Websites

Extract company details from any website — no scraping logic needed:

result = client.extract(
    url="https://stripe.com",
    schema={
        "company_name": {"type": "string"},
        "tagline": {"type": "string", "description": "Main value proposition or tagline"},
        "industry": {"type": "string"},
        "products": {"type": "array", "description": "Main products or services"},
        "social_links": {"type": "array", "description": "Twitter, LinkedIn, GitHub URLs"},
        "pricing_model": {"type": "enum", "options": ["free", "freemium", "paid", "enterprise", "usage_based"]},
    }
)

The API renders JavaScript, parses the DOM, follows links — it reads websites the way a person would, not the way a scraper does.

Receipt and Expense Processing

Process expense receipts from photos or scans:

result = client.extract(
    url="https://storage.example.com/receipts/IMG_4521.jpg",
    schema={
        "merchant": {"type": "string", "required": True},
        "date": {"type": "string"},
        "total": {"type": "number", "required": True},
        "tax": {"type": "number"},
        "payment_method": {"type": "enum", "options": ["cash", "credit", "debit", "other"]},
        "category": {"type": "enum", "options": ["meals", "travel", "supplies", "software", "other"]},
    }
)

How It Compares to Alternatives

Feature	Drive AI Extract	Adobe PDF Extract	AWS Textract	Google Document AI
Schema-based extraction	Yes	No (fixed output)	Limited	Yes (custom processors)
Confidence scores	Per-field	Per-element	Per-word	Per-entity
Source citations	Yes	No	No	No
Website extraction	Yes	No	No	No
OCR with AI proofreading	Yes	Yes	Yes	Yes
File formats	107+	PDF only	PDF, images	PDF, images
Free tier	100 credits/month	500 tx/month	1,000 pages/month	1,000 pages/month
Setup	One API call	SDK + credentials	AWS IAM + SDK	GCP project + SDK

The key difference: you define what you want with a schema, and the API adapts to any document format. No template training, no processor configuration, no format-specific handling. The same schema works on an invoice PDF, a photo of a receipt, and a pricing page on a website.

Pricing

Plan	Credits	Cost
Free	100/month	$0
Pro	Pay as you go	$0.01/credit
Enterprise	Custom volume	Contact us

Documents cost 1 credit per page. Websites cost 5 credits per site. A 10-page contract costs 10 credits. The free tier lets you process roughly 100 pages per month — enough to build and validate your pipeline.

Get Started

Install the SDK:

npm install @thedriveai/sdk

pip install thedriveai

Or use cURL directly:

curl -X POST https://dev.thedrive.ai/api/v1/extract \
  -H "X-API-Key: tda_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "schema": {
      "title": { "type": "string" },
      "date": { "type": "string" },
      "amount": { "type": "number" }
    }
  }'

Get your API key at dev.thedrive.ai and start extracting structured data in minutes.

Have questions or need help with a specific extraction use case? Reach out at contact@thedrive.ai.

Extract Structured Data from Any Document with One API Call — PDF, DOCX, Websites, and More