The API supports four output formats. You can request one or multiple formats in a single API call by comma-separating them (e.g., output_format=markdown,json).
Markdown
Clean, formatted Markdown text preserving document structure including headings, tables, lists, and emphasis.
-F "output_format=markdown"
Markdown Options (markdown_options):
| Option | Description |
|---|
financial-docs | Optimized for financial documents with enhanced table and number formatting |
-F "output_format=markdown" -F "markdown_options=financial-docs"
HTML
Full HTML representation of the document with semantic tags and structure.
JSON
Structured JSON extraction with multiple modes:
| Mode | json_options value | Description |
|---|
| Flat key-value | (omit json_options) | Auto-detected field extraction (default) |
| Specified fields | ["field1", "field2"] | Extract specific fields you define |
| Custom schema | {"type": "object", "properties": {...}} | Define exact JSON output structure |
| Hierarchy output | hierarchy_output | Tree-structured nested data from document |
| Table of contents | table-of-contents | Extract document heading structure |
Use specified fields when you know exactly what data you need. Use custom schema when you need precise control over the output structure, types, and nesting.
Example: Specified Fields
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@invoice.pdf" \
-F "output_format=json" \
-F 'json_options=["invoice_number", "date", "total_amount", "vendor"]'
Example: Custom Schema
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@invoice.pdf" \
-F "output_format=json" \
-F 'json_options={"type": "object", "properties": {"invoice_number": {"type": "string"}, "total_amount": {"type": "number"}}}'
CSV
Tabular data extraction for documents containing tables.
CSV Options (csv_options):
| Option | Description |
|---|
table | Extract structured table data from the document |
-F "output_format=csv" -F "csv_options=table"
Custom Instructions
Guide the extraction with natural language instructions using the custom_instructions parameter:
-F "custom_instructions=Format all dates as YYYY-MM-DD. Extract amounts without currency symbols."
The prompt_mode parameter controls how instructions are applied:
| Mode | Description |
|---|
append | Add your instructions to the base prompt (default) |
replace | Use only your custom instructions |
Include additional metadata like bounding boxes or confidence scores:
-F "include_metadata=bounding_boxes,confidence_score"
| Value | Description |
|---|
bounding_boxes | Block-level bounding boxes per paragraph/region from layout detection |
bounding_boxes_word | Word-level bounding boxes per word using advanced OCR |
confidence_score | Overall confidence (0-100) for all formats, plus granular per-field/per-cell scores for JSON and CSV |
Language Support
The API supports multilingual extraction across 29+ languages. The model automatically detects the language — no configuration required.
Supported Scripts
| Script | Languages |
|---|
| Latin | English, French, Spanish, Portuguese, German, Italian, Dutch, Polish, Czech, Romanian |
| Cyrillic | Russian, Ukrainian |
| Chinese Characters | Simplified Chinese, Traditional Chinese |
| Japanese | Kanji, Hiragana, Katakana |
| Korean Hangul | Korean |
| Arabic Script | Arabic, Persian, Urdu |
| Devanagari | Hindi, Bengali, Sanskrit |
Language Tiers
| Tier | Languages | Performance |
|---|
| Tier 1 | Chinese (Simplified & Traditional), English, Japanese, Korean | Exceptional |
| Tier 2 | Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese | Strong |
| Tier 3 | Indonesian, Malaysian, Turkish, Polish, Dutch, Czech, Romanian, Ukrainian, Greek, Hebrew, Swahili | Good |