Introduction

Overview

The Nanonets Document Extraction API uses advanced AI models to extract structured content from documents. Convert PDFs, images, Word documents, Excel spreadsheets, and more into clean Markdown, HTML, JSON, or CSV formats.

Key Features

Multiple Output Formats

Extract content as Markdown, HTML, JSON, or CSV. Request multiple formats in a single API call.

Real-Time Streaming

Stream extraction results via SSE for real-time UI updates as content is generated.

Batch Processing

Process up to 50 documents in a single batch request with shared extraction options.

Custom Instructions

Guide the extraction with custom instructions for formatting, field focus, and output structure.

Quick Start

Authentication

All API requests require authentication via Bearer token. Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

You can get your API key from the top right menu on docstrange.nanonets.com

Basic Extraction

Extract content from a document using the synchronous endpoint:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

API Endpoints

Document Extraction

Endpoint	Method	Description
`/api/v1/extract/sync`	POST	Synchronous extraction - returns results immediately
`/api/v1/extract/async`	POST	Asynchronous extraction - returns job ID for polling
`/api/v1/extract/stream`	POST	Streaming extraction - real-time results via SSE
`/api/v1/extract/batch`	POST	Batch processing - process multiple files at once

Results

Endpoint	Method	Description
`/api/v1/extract/results/{record_id}`	GET	Get extraction result by job ID
`/api/v1/extract/results`	GET	List all extraction results (paginated)

Output Formats

Markdown

Clean, formatted Markdown text preserving document structure including headings, tables, lists, and emphasis. Markdown Options (markdown_options):

financial-docs: Optimized extraction for financial documents with enhanced table and number formatting

HTML

Full HTML representation of the document with semantic tags and structure.

JSON

Structured JSON extraction with options for:

Flat key-value: Simple field extraction (default)
Specified fields: Extract specific fields you define
Custom schema: Define exact JSON output structure
Hierarchy output: Tree-structured nested data from document (json_options=hierarchy_output)

CSV

Tabular data extraction for documents containing tables. CSV Options (csv_options):

table: Extract structured table data from the document

Input Methods

Provide your document using one of these methods:

Parameter	Description
`file`	Direct file upload (multipart/form-data)
`file_url`	URL to download the file from
`file_base64`	Base64-encoded file content

Supported file types: PDF, Word (.docx), Excel (.xlsx, .xls), PowerPoint (.pptx), Images (PNG, JPG, TIFF, WebP)

Language Support

The API supports multilingual document extraction across 29+ languages. Language support is tiered based on extraction quality.

Supported Scripts

Script	Languages
Latin	English, French, Spanish, Portuguese, German, Italian, Dutch, Polish, Czech, Romanian
Cyrillic	Russian, Ukrainian
Chinese Characters	Simplified Chinese, Traditional Chinese
Japanese	Kanji, Hiragana, Katakana
Korean Hangul	Korean
Arabic Script	Arabic, Persian, Urdu
Devanagari	Hindi, Bengali, Sanskrit

Language Tiers

Tier	Languages	Performance
Tier 1	Chinese (Simplified & Traditional), English, Japanese, Korean	Exceptional
Tier 2	Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese	Strong
Tier 3	Indonesian, Malaysian, Turkish, Polish, Dutch, Czech, Romanian, Ukrainian, Greek, Hebrew, Swahili	Good

The model automatically detects the language of your document and processes it accordingly—no configuration required.

Optional Features

Custom Instructions

Guide the extraction with custom instructions:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "output_format=json" \
  -F "custom_instructions=Format all dates as YYYY-MM-DD. Extract amounts without currency symbols."

Metadata Extraction

Include additional metadata like bounding boxes or confidence scores:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown" \
  -F "include_metadata=bounding_boxes,confidence_score"

Rate Limits & Quotas

Sync processing: Use for documents under 5 pages
Async processing: Recommended for larger documents (>5 pages)
Batch processing: Maximum 50 files per request
Rate limits by plan:
- Free: 20 pages/min
- Pay as you go: 300 pages/min
- Enterprise: 100 pages/sec

Infrastructure Requirements (On-Premise)

For on-premise deployments:

Resource	Recommendation
CPU	32 cores
RAM	128GB
GPU	NVIDIA A100

Getting Started

Resources

Overview

Key Features

Multiple Output Formats

Real-Time Streaming

Batch Processing

Custom Instructions

Quick Start

Authentication

Basic Extraction

API Endpoints

Document Extraction

Results

Output Formats

Markdown

HTML

JSON

CSV

Input Methods

Language Support

Supported Scripts

Language Tiers

Optional Features

Custom Instructions

Metadata Extraction

Rate Limits & Quotas

Infrastructure Requirements (On-Premise)

Performance Benchmarks (200-page PDF)

Need Help?

API Reference

Getting Started

Resources

​Overview

​Key Features

Multiple Output Formats

Real-Time Streaming

Batch Processing

Custom Instructions

​Quick Start

​Authentication

​Basic Extraction

​API Endpoints

​Document Extraction

​Results

​Output Formats

​Markdown

​HTML

​JSON

​CSV

​Input Methods

​Language Support

​Supported Scripts

​Language Tiers

​Optional Features

​Custom Instructions

​Metadata Extraction

​Rate Limits & Quotas

​Infrastructure Requirements (On-Premise)

​Performance Benchmarks (200-page PDF)

​Need Help?

API Reference

Overview

Key Features

Quick Start

Authentication

Basic Extraction

API Endpoints

Document Extraction

Results

Output Formats

Markdown

HTML

JSON

CSV

Input Methods

Language Support

Supported Scripts

Language Tiers

Optional Features

Custom Instructions

Metadata Extraction

Rate Limits & Quotas

Infrastructure Requirements (On-Premise)

Performance Benchmarks (200-page PDF)

Need Help?