Skip to main content

Overview

The Nanonets Document Extraction API uses advanced AI models to extract structured content from documents. Convert PDFs, images, Word documents, Excel spreadsheets, and more into clean Markdown, HTML, JSON, or CSV formats.

Key Features

Multiple Output Formats

Extract content as Markdown, HTML, JSON, or CSV. Request multiple formats in a single API call.

Real-Time Streaming

Stream extraction results via SSE for real-time UI updates as content is generated.

Batch Processing

Process up to 50 documents in a single batch request with shared extraction options.

Custom Instructions

Guide the extraction with custom instructions for formatting, field focus, and output structure.

Quick Start

Authentication

All API requests require authentication via Bearer token. Include your API key in the Authorization header:
Authorization: Bearer YOUR_API_KEY
You can get your API key from the top right menu on docstrange.nanonets.com

Basic Extraction

Extract content from a document using the synchronous endpoint:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

API Endpoints

Document Extraction

EndpointMethodDescription
/api/v1/extract/syncPOSTSynchronous extraction - returns results immediately
/api/v1/extract/asyncPOSTAsynchronous extraction - returns job ID for polling
/api/v1/extract/streamPOSTStreaming extraction - real-time results via SSE
/api/v1/extract/batchPOSTBatch processing - process multiple files at once

Results

EndpointMethodDescription
/api/v1/extract/results/{record_id}GETGet extraction result by job ID
/api/v1/extract/resultsGETList all extraction results (paginated)

Output Formats

Markdown

Clean, formatted Markdown text preserving document structure including headings, tables, lists, and emphasis. Markdown Options (markdown_options):
  • financial-docs: Optimized extraction for financial documents with enhanced table and number formatting

HTML

Full HTML representation of the document with semantic tags and structure.

JSON

Structured JSON extraction with options for:
  • Flat key-value: Simple field extraction (default)
  • Specified fields: Extract specific fields you define
  • Custom schema: Define exact JSON output structure
  • Hierarchy output: Tree-structured nested data from document (json_options=hierarchy_output)

CSV

Tabular data extraction for documents containing tables. CSV Options (csv_options):
  • table: Extract structured table data from the document

Input Methods

Provide your document using one of these methods:
ParameterDescription
fileDirect file upload (multipart/form-data)
file_urlURL to download the file from
file_base64Base64-encoded file content
Supported file types: PDF, Word (.docx), Excel (.xlsx, .xls), PowerPoint (.pptx), Images (PNG, JPG, TIFF, WebP)

Language Support

The API supports multilingual document extraction across 29+ languages. Language support is tiered based on extraction quality.

Supported Scripts

ScriptLanguages
LatinEnglish, French, Spanish, Portuguese, German, Italian, Dutch, Polish, Czech, Romanian
CyrillicRussian, Ukrainian
Chinese CharactersSimplified Chinese, Traditional Chinese
JapaneseKanji, Hiragana, Katakana
Korean HangulKorean
Arabic ScriptArabic, Persian, Urdu
DevanagariHindi, Bengali, Sanskrit

Language Tiers

TierLanguagesPerformance
Tier 1Chinese (Simplified & Traditional), English, Japanese, KoreanExceptional
Tier 2Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Thai, VietnameseStrong
Tier 3Indonesian, Malaysian, Turkish, Polish, Dutch, Czech, Romanian, Ukrainian, Greek, Hebrew, SwahiliGood
The model automatically detects the language of your document and processes it accordingly—no configuration required.

Optional Features

Custom Instructions

Guide the extraction with custom instructions:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "output_format=json" \
  -F "custom_instructions=Format all dates as YYYY-MM-DD. Extract amounts without currency symbols."

Metadata Extraction

Include additional metadata like bounding boxes or confidence scores:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown" \
  -F "include_metadata=bounding_boxes,confidence_score"

Rate Limits & Quotas

  • Sync processing: Use for documents under 5 pages
  • Async processing: Recommended for larger documents (>5 pages)
  • Batch processing: Maximum 50 files per request
  • Rate limits by plan:
    • Free: 20 pages/min
    • Pay as you go: 300 pages/min
    • Enterprise: 100 pages/sec

Infrastructure Requirements (On-Premise)

For on-premise deployments:
ResourceRecommendation
CPU32 cores
RAM128GB
GPUNVIDIA A100

Performance Benchmarks (200-page PDF)

  • P50 response time: 9.5 seconds
  • P90 response time: 12.6 seconds
  • P99 response time: 20.3 seconds

Need Help?

API Reference

Explore the complete API documentation with interactive examples.