Extract Structured Data from Any Document with One API Call — PDF, DOCX, Websites, and More
The Real Cost of Unstructured Documents
Every business runs on documents — invoices, contracts, receipts, medical forms, property listings. The data inside them is structured, but the format is not. Extracting that data means regex, templates, manual entry, or expensive platforms that charge per page and still break when the layout changes.
Traditional extraction tools rely on template matching. They work until a vendor sends an invoice in a slightly different format. Then they fail silently, returning wrong data with high confidence — the worst kind of bug.
LLM-grounded extraction solves this. Instead of matching templates, it reads the document the way a human would, understands context, and adapts to layout variations. Studies show template-based extractors drop 12-28 F1 points on new document formats. LLM-based tools drop 2-6.
We built an extraction API around this approach.
How It Works
The Drive AI Extract API takes a document and a schema, and returns typed JSON with confidence scores and source citations.
POST https://dev.thedrive.ai/api/v1/extract
Define What You Want
Send a schema describing the fields you need. Each field has a type, a description, and an optional required flag:
{
"url": "https://example.com/invoice-2024-003.pdf",
"schema": {
"vendor_name": {
"type": "string",
"description": "Company or person who issued the invoice",
"required": true
},
"invoice_date": {
"type": "string",
"description": "Date the invoice was issued (ISO 8601)"
},
"total_amount": {
"type": "number",
"description": "Total amount due including tax"
},
"line_items": {
"type": "array",
"description": "List of items with description, quantity, and unit price"
},
"payment_terms": {
"type": "enum",
"description": "Payment terms",
"options": ["net_15", "net_30", "net_60", "due_on_receipt"]
}
}
}
Get Structured Data Back
The response includes the extracted data, a confidence score for each field, and citations pointing to the exact text the data was pulled from:
{
"data": {
"vendor_name": "Acme Corp",
"invoice_date": "2024-11-15",
"total_amount": 4750.00,
"line_items": [
{ "description": "Consulting — November", "quantity": 40, "unit_price": 100 },
{ "description": "Travel expenses", "quantity": 1, "unit_price": 750 }
],
"payment_terms": "net_30"
},
"confidence": {
"vendor_name": "high",
"invoice_date": "high",
"total_amount": "high",
"line_items": "high",
"payment_terms": "medium"
},
"citations": {
"vendor_name": "Acme Corp, Inc. — 123 Business Ave, Suite 400",
"invoice_date": "Invoice Date: November 15, 2024",
"total_amount": "Total Due: $4,750.00",
"payment_terms": "Terms: Net 30 days from invoice date"
}
}
Every field is traceable. If the confidence is medium or low, you can inspect the citation to understand why and decide how to handle it in your pipeline.
Authentication
curl -X POST https://dev.thedrive.ai/api/v1/extract \
-H "X-API-Key: tda_live_..." \
-H "Content-Type: application/json" \
-d '{ "url": "...", "schema": { ... } }'
What You Can Extract From
The API handles documents, images, and websites through the same endpoint.
| Source Type | Examples | Credits |
|---|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF, EPUB, TXT | 1 credit/page |
| Spreadsheets | XLSX, XLS, ODS, CSV, TSV | 1 credit/page |
| Presentations | PPTX, PPT, ODP, KEY | 1 credit/page |
| Images | JPG, PNG, TIFF, HEIC (via OCR) | 1 credit/page |
| Websites | Any public URL | 5 credits/site |
Scanned documents and images go through OCR with vision model proofreading — so even a photo of a handwritten form or a stamped receipt gets extracted accurately.
Real-World Extraction Patterns
Invoice Processing
Automate accounts payable by extracting vendor, amounts, dates, and line items from invoices in any format:
from thedriveai import TheDriveAI
client = TheDriveAI(api_key="tda_live_...")
result = client.extract(
url="https://storage.example.com/invoices/INV-2024-003.pdf",
schema={
"vendor": {"type": "string", "required": True},
"invoice_number": {"type": "string", "required": True},
"date": {"type": "string", "description": "ISO 8601 date"},
"total": {"type": "number"},
"line_items": {"type": "array", "description": "description, qty, unit_price, amount"},
"tax_amount": {"type": "number"},
}
)
# Route based on confidence
if all(c == "high" for c in result.confidence.values()):
auto_process(result.data)
else:
flag_for_review(result.data, result.confidence)
Contract Review
Extract key terms from legal documents — parties, dates, obligations, and risk clauses:
import { TheDriveAI } from '@thedriveai/sdk';
const client = new TheDriveAI({ apiKey: 'tda_live_...' });
const result = await client.extract({
url: 'https://storage.example.com/contracts/vendor-agreement.pdf',
schema: {
parties: { type: 'array', description: 'Names of all parties to the agreement' },
effective_date: { type: 'string', description: 'When the contract takes effect' },
termination_date: { type: 'string', description: 'When the contract expires' },
liability_cap: { type: 'number', description: 'Maximum liability amount in USD' },
governing_law: { type: 'string', description: 'Jurisdiction governing the contract' },
auto_renewal: { type: 'boolean', description: 'Whether the contract auto-renews' },
notice_period_days: { type: 'number', description: 'Days of notice required for termination' },
},
});
Lead Enrichment from Websites
Extract company details from any website — no scraping logic needed:
result = client.extract(
url="https://stripe.com",
schema={
"company_name": {"type": "string"},
"tagline": {"type": "string", "description": "Main value proposition or tagline"},
"industry": {"type": "string"},
"products": {"type": "array", "description": "Main products or services"},
"social_links": {"type": "array", "description": "Twitter, LinkedIn, GitHub URLs"},
"pricing_model": {"type": "enum", "options": ["free", "freemium", "paid", "enterprise", "usage_based"]},
}
)
The API renders JavaScript, parses the DOM, follows links — it reads websites the way a person would, not the way a scraper does.
Receipt and Expense Processing
Process expense receipts from photos or scans:
result = client.extract(
url="https://storage.example.com/receipts/IMG_4521.jpg",
schema={
"merchant": {"type": "string", "required": True},
"date": {"type": "string"},
"total": {"type": "number", "required": True},
"tax": {"type": "number"},
"payment_method": {"type": "enum", "options": ["cash", "credit", "debit", "other"]},
"category": {"type": "enum", "options": ["meals", "travel", "supplies", "software", "other"]},
}
)
How It Compares to Alternatives
| Feature | Drive AI Extract | Adobe PDF Extract | AWS Textract | Google Document AI |
|---|---|---|---|---|
| Schema-based extraction | Yes | No (fixed output) | Limited | Yes (custom processors) |
| Confidence scores | Per-field | Per-element | Per-word | Per-entity |
| Source citations | Yes | No | No | No |
| Website extraction | Yes | No | No | No |
| OCR with AI proofreading | Yes | Yes | Yes | Yes |
| File formats | 107+ | PDF only | PDF, images | PDF, images |
| Free tier | 100 credits/month | 500 tx/month | 1,000 pages/month | 1,000 pages/month |
| Setup | One API call | SDK + credentials | AWS IAM + SDK | GCP project + SDK |
The key difference: you define what you want with a schema, and the API adapts to any document format. No template training, no processor configuration, no format-specific handling. The same schema works on an invoice PDF, a photo of a receipt, and a pricing page on a website.
Pricing
| Plan | Credits | Cost |
|---|---|---|
| Free | 100/month | $0 |
| Pro | Pay as you go | $0.01/credit |
| Enterprise | Custom volume | Contact us |
Documents cost 1 credit per page. Websites cost 5 credits per site. A 10-page contract costs 10 credits. The free tier lets you process roughly 100 pages per month — enough to build and validate your pipeline.
Get Started
Install the SDK:
npm install @thedriveai/sdk
pip install thedriveai
Or use cURL directly:
curl -X POST https://dev.thedrive.ai/api/v1/extract \
-H "X-API-Key: tda_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"schema": {
"title": { "type": "string" },
"date": { "type": "string" },
"amount": { "type": "number" }
}
}'
Get your API key at dev.thedrive.ai and start extracting structured data in minutes.
Have questions or need help with a specific extraction use case? Reach out at contact@thedrive.ai.
Share it with your network
