Skip to main content

JSON Extraction

When extracting structured JSON data from documents, each extracted value is assigned a confidence score (0-100) indicating the model’s certainty.

Calculation Methodology

The confidence score is derived from token log-probabilities returned by the LLM during generation. For each primitive value (string, number, boolean, null) in the JSON output:
  1. The system identifies the character span of the value in the raw response
  2. All tokens falling within that span are collected
  3. Log-probabilities are converted to probabilities using exp(logprob)
  4. The final confidence is the average probability of all tokens in the span, scaled to 0-100
The algorithm traverses the JSON structure recursively, building a parallel confidence_map that mirrors the original JSON hierarchy—dictionaries map to dictionaries of scores, arrays map to arrays of scores, and primitives map to integer confidence values.

Output Structure

The confidence scores are returned in a metadata object parallel to the content.
JSON Output
{
  "content": {
    "invoice_number": "INV-2023-001",
    "total_amount": 1500.00
  },
  "metadata": {
    "confidence_score": {
      "invoice_number": 98,
      "total_amount": 95
    }
  }
}

Use Cases

Quality Filtering

Reject or flag extractions where key fields fall below a specific threshold (e.g., < 70) to ensure high data integrity.

Human Review Routing

Automatically route low-confidence documents to manual review queues for human verification.

Data Validation

Cross-reference low-confidence fields against external sources or databases for validation.

Analytics

Track extraction quality trends over time and across different document types to monitor performance.

Markdown with Bounding Boxes

For markdown with bounding_boxes output type, confidence scores are provided at the region level rather than per-token.

Calculation Methodology

This flow uses a layout detection model to segment the document into regions (text blocks, tables, figures, etc.). Each detected region carries a confidence score from the layout detection model itself, representing how certain the model is about:
  1. The region’s classification (type: paragraph, table, heading, etc.)
  2. The region’s bounding box accuracy
The confidence value is a float between 0.0 and 1.0, embedded directly in each element’s bounding_box object.

Output Structure

JSON Output
{
  "content": "extracted markdown content",
  "metadata": {
    "bounding_boxes": {
      "success": true,
      "elements": [
        {
          "content": "extracted markdown content",
          "bounding_box": {
            "x": 0.05,
            "y": 0.10,
            "width": 0.90,
            "height": 0.15,
            "confidence": 0.95,
            "type": "paragraph",
            "page": 1
          }
        }
      ]
    }
  }
}

Use Cases

Region Filtering

Exclude low-confidence regions from downstream processing to remove noise.

Layout Quality Assessment

Identify documents with complex or ambiguous layouts that may require special handling.

Overlay Visualization

Display confidence scores as color-coded bounding boxes in a document viewer for visual inspection.

Table Extraction

Prioritize high-confidence table regions for more intensive structured extraction pipelines.