Sign every extracted field. Prove where it came from.
OCR gives you a value and a confidence score. Confidence is the model grading itself. AFiR-OCR binds each extracted field to its source region, the extracting model, and that confidence into one tamper-evident receipt. Move a bounding box by a single pixel and verification fails. Wraps any extraction engine. Offline zero-secret verify.
OCR output is an unverifiable assertion. Then it moves real money.
Invoices, contracts, claims, intake forms, IDs, titles. The value an extraction model reads off a page drives a decision and a payment. There is no portable proof of what was read, from where, by which model — or whether it was altered after extraction.
The model grades its own homework
A confidence score is a self-assessment, not evidence. It cannot tell you the field was not altered before it reached your system, and it is not tamper-evident.
No spatial grounding
Signing the extracted text alone proves some text was produced. It says nothing about where on the page it came from, or by which model. The binding to the source region is gone.
Alteration stays invisible
Bounding boxes in a database with no tamper-evident accumulator over the per-field bindings means a silent edit to a value, a position, or a confidence is undetectable. No audit trail.
Field, source region, model, confidence — bound into one root, signed once.
The extraction engine extracts. AFiR-OCR proves. Each field becomes a per-field leaf binding its value commitment, its page and bounding box, its block class, the model, and its confidence. Leaves accumulate into a Merkle root. One ML-DSA-65 signature seals the whole document — independent of field count.
EXTRACTION ENGINE PER-FIELD BINDING SIGN + VERIFY ----------------- ----------------- ------------- any OCR / IDP --> field + bbox --+ ( Mistral OCR, Textract, ... ) | v +-----------------------+ +----------------------+ | per-field leaf | ---> | Merkle root | | value_commit | | ( fields_root ) | | source_region(bbox) | +----------------------+ | block_class | | | model_ref | v | confidence | +----------------------+ +-----------------------+ | ML-DSA-65 signature | <-- one per doc | | PQ-anchored receipt | v +----------------------+ sensitive field? | value_commit = salted, v off-receipt ( redaction- offline, zero-secret verify compatible ) ( recompute root, check sig )
Four properties. Each one tested. Each one breaks verify when attacked.
All four bind into the same signature. Any tamper — to a value, a position, a class, or a confidence — breaks the recomputed root and fails offline verify.
Every field is bound to the exact region it was read from.
The page index and bounding box are inputs to the per-field leaf. The binding of a field to its source region is itself tamper-evident — a one-pixel move changes the leaf, changes the root, and fails verify.
- source_region: { page, bbox: [x0, y0, x1, y1] }
- block_class bound too: header | field | table_cell
- tamper test: bbox moved 1px on one field fails offline verify
- each field provably tied to the page coordinates it came from
Change the number after extraction and the receipt rejects it.
The value commitment for each field feeds its leaf. Altering a bound value — an invoice total from one amount to a larger one — changes the leaf, changes the fields_root, and the recomputed root no longer matches the signed root.
- value_commit bound per field into the Merkle leaf
- tamper test: total changed post-sign fails verify
- no shared secret required to detect the change
- detection is deterministic, not probabilistic
The tax ID never lands in the receipt. The receipt still proves it was there.
For a sensitive field, the value commitment is a non-revealing commitment computed with a salt held off the receipt. Presence, position, class, confidence, and the fact of extraction stay provable. The cleartext never enters the chain.
- sensitive value_commit = salted, off-receipt
- tax_id / SSN / PAN / PHI provably absent yet provably extracted
- holder with the salt re-proves the specific value on demand
- wrong value fails to match the commitment
The receipt asserts what was read — never that it is right.
The asserts field is set to extraction_provenance_only. It proves the named model read the named region and produced the bound value at the bound confidence. It does not claim the value is correct. No truth-oracle overclaim.
- asserts: "extraction_provenance_only"
- correctness validation, where done, is a separate signed attestation
- that attestation references this receipt by its root
- the limit is declared on the face of the artifact
Reduction to practice. Real values. Tamper rejected.
A two-page invoice, nine fields, run from a fresh checkout. Six smoke criteria, six green. The hashes below are the actual outputs of that test run.
Smoke test 6 of 6 pass
- 1✓Independent, zero-secret, offline verify of the whole receipt
- 2✓Altering a bound value breaks the recomputed fields_root
- 3✓Moving a bounding box one pixel breaks verify
- 4✓Sensitive field provably absent yet provably extracted
- 5✓Holder re-proves the redacted value; wrong value fails
- 6✓One ML-DSA-65 signature per document, field-count independent
Signed test-run outputs
Nine fields: vendor, invoice no., date, bill-to tax ID (redacted), line item, amount, subtotal, tax, total. Adapters tested against two distinct extraction-engine output formats. The engine extracts; the receipt proves.
Wrap your OCR. Issue a receipt. Verify offline.
Point AFiR-OCR at your extraction output — Mistral OCR, AWS Textract, any engine that emits regions and confidences. You get back one PQ-anchored receipt per document. Anyone with the public key verifies the whole thing offline.
import { buildReceipt, verify } from "@hive/afir-ocr" // 1. run your OCR engine as usual -- AFiR-OCR does not change it const ocr = await mistralOCR(invoicePdf) // fields + bbox + confidence // 2. bind every field to its source region, model, and confidence const receipt = await buildReceipt({ doc: invoicePdf, model_ref: "mir:mistral-ocr-4", fields: ocr.fields, // value, bbox, block_class, confidence sensitive: ["bill_to_tax_id"], // redaction-compatible commitment asserts: "extraction_provenance_only", // provenance, not correctness }) // receipt.fields_root -> 0x64c489...a8a34a1 // receipt.sig -> one ML-DSA-65 signature over the whole doc // 3. anyone verifies offline -- no shared secret, no call home const { ok, why } = await verify(receipt, PUBLIC_KEY) // move one bbox by a pixel, change one total -> ok === false
Built for document pipelines that have to be auditable.
Anywhere a model reads a regulated document and the value drives a decision or a payment.
Invoices & AP
Extracted totals and line items feed payment runs. AFiR-OCR proves the total on the receipt is the total read off the page, untampered.
Contracts & claims
Clause and figure extraction with a provable tie to the source region. Redaction-compatible for privileged or sensitive spans.
Clinical intake forms
PHI fields stay out of the receipt while remaining provably extracted. The data class lands on the receipt; the data does not.
OCR & IDP vendors
If you sell extraction to regulated buyers, AFiR-OCR is the proof layer their auditors will ask for. Your engine becomes the distribution surface.