Structured Data Extraction
Structured data extraction is the core value proposition of document intelligence platforms: taking a document that exists as flowing text, tables, and formatting — with no inherent database schema — and producing clean, typed, queryable data records. The output is not just raw text but parsed values: dates in ISO format, currency amounts as numbers with symbols, boolean fields as true/false, and lists as arrays. Each extracted value carries metadata including the source page, a confidence score, and the model that produced it.
The extraction process begins with defining a field schema for a project. Users specify what to extract — field name, data type (text, number, date, boolean, list), a description of the field, and optionally custom extraction instructions. This schema acts as a contract: the system knows exactly what structured output to produce, regardless of how the source documents are formatted. DocumentIQ stores these schemas in the extraction_fields table and applies them consistently across all documents in a project.
Confidence scores are a critical component. Every extracted value comes with a float between 0.0 and 1.0 indicating the model's certainty. Low-confidence extractions can be routed to human review, while high-confidence values flow straight into downstream systems. Combined with the feedback mechanism — where reviewers can mark values as correct, incorrect, or provide corrections — the extraction pipeline becomes a human-in-the-loop system that balances automation speed with data quality requirements.