How to Extract Data from Excel Spreadsheets via API

Why Extracting Data from Spreadsheets Is Harder Than It Looks

Spreadsheets look simple. Rows, columns, values. But programmatic extraction is full of edge cases that break naive parsers.

Merged cells span multiple rows or columns and disrupt positional logic. Formula-dependent values require evaluation — the cell shows $14,200 but the underlying data is =SUM(B2:B47). Multiple sheets mean the data you need might be on "Q3 Summary" while the raw data lives on "Transactions." Mixed types in a single column — dates, currencies, percentages, free text — require per-cell type inference.

Libraries like openpyxl (Python) and SheetJS (Node.js) give you raw cell access, but you still have to write the logic to handle all of this. For every new spreadsheet layout, you write new parsing code. Financial reports look different from inventory lists. Survey exports look different from CRM dumps. The maintenance cost compounds fast.

An extraction API eliminates this. You describe the data you want as a schema, send the file, and get typed JSON back — regardless of the spreadsheet's layout, format, or quirks.

The API Approach: One Endpoint for All Spreadsheet Formats

The Drive AI Extract API accepts any spreadsheet format — XLSX, XLS, ODS, CSV, TSV, and Apple Numbers — and returns structured data based on the schema you define.

POST https://dev.thedrive.ai/api/v1/extract

Authentication uses an API key passed via header:

X-API-Key: tda_live_...

You define a schema with field names, types, and descriptions. The API reads the spreadsheet, understands the layout, and extracts exactly what you asked for.

Basic Extraction: Financial Report

Suppose you receive quarterly financial reports as XLSX files from multiple departments. Each has slightly different formatting. Here is how you extract the data you need.

Python

import requests

response = requests.post(
    "https://dev.thedrive.ai/api/v1/extract",
    headers={"X-API-Key": "tda_live_..."},
    json={
        "url": "https://example.com/q3-financial-report.xlsx",
        "schema": {
            "quarter": {
                "type": "string",
                "description": "Fiscal quarter (e.g. Q3 2025)"
            },
            "total_revenue": {
                "type": "number",
                "description": "Total revenue for the quarter"
            },
            "total_expenses": {
                "type": "number",
                "description": "Total operating expenses"
            },
            "net_income": {
                "type": "number",
                "description": "Net income after expenses"
            },
            "department_breakdown": {
                "type": "array",
                "description": "Revenue and expenses per department"
            }
        }
    }
)

data = response.json()
print(data["data"]["net_income"])

Node.js

const response = await fetch("https://dev.thedrive.ai/api/v1/extract", {
  method: "POST",
  headers: {
    "X-API-Key": "tda_live_...",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/q3-financial-report.xlsx",
    schema: {
      quarter: { type: "string", description: "Fiscal quarter" },
      total_revenue: { type: "number", description: "Total revenue" },
      total_expenses: { type: "number", description: "Total operating expenses" },
      net_income: { type: "number", description: "Net income after expenses" },
      department_breakdown: {
        type: "array",
        description: "Revenue and expenses per department",
      },
    },
  }),
});

const { data } = await response.json();
console.log(data.net_income);

The API handles merged cells, evaluates formulas, and adapts to different layouts automatically. The same schema works whether the revenue figure is in cell B12 or D47.

Handling Multi-Sheet Workbooks

Many business spreadsheets use multiple sheets — one for raw data, one for summaries, one for charts. The Extract API reads across all sheets by default, so your schema can reference data from any sheet without specifying where it lives.

For an inventory workbook with sheets named "Products," "Suppliers," and "Reorder Alerts":

{
  "url": "https://example.com/inventory-q3.xlsx",
  "schema": {
    "low_stock_items": {
      "type": "array",
      "description": "Products below reorder threshold with SKU, current stock, and reorder quantity"
    },
    "top_suppliers": {
      "type": "array",
      "description": "Top 5 suppliers by total order volume"
    },
    "total_inventory_value": {
      "type": "number",
      "description": "Total value of current inventory in USD"
    }
  }
}

The API finds the relevant data across sheets and returns a unified JSON response.

Extracting from CSV and TSV Files

CSV and TSV files are simpler in structure but often messier in practice — inconsistent quoting, mixed delimiters, encoding issues, header rows that aren't on line 1. The same Extract API endpoint handles these formats without any configuration changes.

This works well for survey data exports, log files, and database dumps where you need specific fields extracted and typed correctly.

response = requests.post(
    "https://dev.thedrive.ai/api/v1/extract",
    headers={"X-API-Key": "tda_live_..."},
    json={
        "url": "https://example.com/survey-results.csv",
        "schema": {
            "total_responses": {
                "type": "number",
                "description": "Total number of survey responses"
            },
            "average_satisfaction": {
                "type": "number",
                "description": "Average satisfaction score (1-10)"
            },
            "top_complaints": {
                "type": "array",
                "description": "Most common complaints with frequency count"
            }
        }
    }
)

Using the Analyze API for Spreadsheet Calculations

When you need computed answers rather than raw extraction — trends, comparisons, statistical analysis — the Analyze API runs multi-step reasoning with Python execution on your spreadsheet data.

POST https://dev.thedrive.ai/api/v1/analyze

Ask a natural-language question and get a computed answer:

{
  "url": "https://example.com/sales-data-2025.xlsx",
  "query": "What is the month-over-month growth rate for each product category, and which category has the most consistent growth?"
}

The API loads the spreadsheet, writes and executes Python code to compute the answer, and returns both the result and the code it ran. This is useful for ad-hoc analysis on spreadsheets where you do not know the exact fields in advance.

Converting Spreadsheets to Markdown

If you need spreadsheet content as context for an LLM pipeline, the Markdown API converts any supported format to clean markdown tables.

GET https://dev.thedrive.ai/md/{url}

import urllib.parse

file_url = urllib.parse.quote("https://example.com/report.xlsx", safe="")
response = requests.get(
    f"https://dev.thedrive.ai/md/{file_url}",
    headers={"X-API-Key": "tda_live_..."}
)

markdown_content = response.text

This returns each sheet as a markdown table with headers preserved. Useful for feeding spreadsheet data into RAG systems, summarization pipelines, or chat interfaces.

Supported Formats

The Extract API supports these spreadsheet formats with no configuration changes:

XLSX — Microsoft Excel (2007+)
XLS — Microsoft Excel (legacy)
ODS — OpenDocument Spreadsheet (LibreOffice, Google Sheets export)
CSV — Comma-separated values
TSV — Tab-separated values
NUMBERS — Apple Numbers

All formats go through the same endpoint. No format-specific flags, no separate parsers.

Getting Started

Get a free API key at dev.thedrive.ai — 100 credits per month, no credit card required.
Install the SDK: pip install thedrive-ai or npm install thedrive-ai.
Send your first extraction request using the examples above.

Each page of a spreadsheet costs 1 credit. Pro plans run at $0.01 per credit for higher volumes.

Have questions? Reach out at contact@thedrive.ai.