JSON Extraction
When extracting structured JSON data from documents, each extracted value is assigned a confidence score (0-100) indicating the model’s certainty.Calculation Methodology
The confidence score is derived from token log-probabilities returned by the LLM during generation. For each primitive value (string, number, boolean, null) in the JSON output:- The system identifies the character span of the value in the raw response
- All tokens falling within that span are collected
- Log-probabilities are converted to probabilities using
exp(logprob) - The final confidence is the average probability of all tokens in the span, scaled to 0-100
confidence_map that mirrors the original JSON hierarchy—dictionaries map to dictionaries of scores, arrays map to arrays of scores, and primitives map to integer confidence values.
Output Structure
The confidence scores are returned in ametadata object parallel to the content.
JSON Output
Use Cases
Quality Filtering
Reject or flag extractions where key fields fall below a specific threshold (e.g., < 70) to ensure high data integrity.
Human Review Routing
Automatically route low-confidence documents to manual review queues for human verification.
Data Validation
Cross-reference low-confidence fields against external sources or databases for validation.
Analytics
Track extraction quality trends over time and across different document types to monitor performance.
Markdown with Bounding Boxes
For markdown with bounding_boxes output type, confidence scores are provided at the region level rather than per-token.Calculation Methodology
This flow uses a layout detection model to segment the document into regions (text blocks, tables, figures, etc.). Each detected region carries a confidence score from the layout detection model itself, representing how certain the model is about:- The region’s classification (type: paragraph, table, heading, etc.)
- The region’s bounding box accuracy
The confidence value is a float between 0.0 and 1.0, embedded directly in each element’s
bounding_box object.Output Structure
JSON Output
Use Cases
Region Filtering
Exclude low-confidence regions from downstream processing to remove noise.
Layout Quality Assessment
Identify documents with complex or ambiguous layouts that may require special handling.
Overlay Visualization
Display confidence scores as color-coded bounding boxes in a document viewer for visual inspection.
Table Extraction
Prioritize high-confidence table regions for more intensive structured extraction pipelines.