AFiR · OCR · DocProof Patent Pending

Sign every extracted field. Prove where it came from.

OCR gives you a value and a confidence score. Confidence is the model grading itself. AFiR-OCR binds each extracted field to its source region, the extracting model, and that confidence into one tamper-evident receipt. Move a bounding box by a single pixel and verification fails. Wraps any extraction engine. Offline zero-secret verify.

Get an API key → See the proof ↓

9 of 9

fields bound + verified

1px

bbox move breaks verify

ML-DSA-65

PQ signature per doc

6 of 6

smoke criteria PASS

The gap we closed

OCR output is an unverifiable assertion. Then it moves real money.

Invoices, contracts, claims, intake forms, IDs, titles. The value an extraction model reads off a page drives a decision and a payment. There is no portable proof of what was read, from where, by which model — or whether it was altered after extraction.

×Confidence only

The model grades its own homework

A confidence score is a self-assessment, not evidence. It cannot tell you the field was not altered before it reached your system, and it is not tamper-evident.

×Sign the text

No spatial grounding

Signing the extracted text alone proves some text was produced. It says nothing about where on the page it came from, or by which model. The binding to the source region is gone.

×Store the boxes

Alteration stays invisible

Bounding boxes in a database with no tamper-evident accumulator over the per-field bindings means a silent edit to a value, a position, or a confidence is undetectable. No audit trail.

How it works

Field, source region, model, confidence — bound into one root, signed once.

The extraction engine extracts. AFiR-OCR proves. Each field becomes a per-field leaf binding its value commitment, its page and bounding box, its block class, the model, and its confidence. Leaves accumulate into a Merkle root. One ML-DSA-65 signature seals the whole document — independent of field count.

EXTRACTION ENGINE                 PER-FIELD BINDING                 SIGN + VERIFY
-----------------                 -----------------                 -------------
any OCR / IDP  -->  field + bbox  --+
( Mistral OCR, Textract, ... )      |
                                    v
                          +-----------------------+        +----------------------+
                          |  per-field leaf       |  --->  |  Merkle root         |
                          |  value_commit         |        |  ( fields_root )     |
                          |  source_region(bbox)  |        +----------------------+
                          |  block_class          |                  |
                          |  model_ref            |                  v
                          |  confidence           |        +----------------------+
                          +-----------------------+        |  ML-DSA-65 signature | <-- one per doc
                                    |                       |  PQ-anchored receipt |
                                    v                       +----------------------+
                          sensitive field?                            |
                          value_commit = salted,                      v
                          off-receipt ( redaction-          offline, zero-secret verify
                          compatible )                      ( recompute root, check sig )

extraction — any engine, unmodified binding — field to source region sign + verify — one PQ seal, offline audit

What the receipt guarantees

Four properties. Each one tested. Each one breaks verify when attacked.

All four bind into the same signature. Any tamper — to a value, a position, a class, or a confidence — breaks the recomputed root and fails offline verify.

A Spatial grounding

Every field is bound to the exact region it was read from.

The page index and bounding box are inputs to the per-field leaf. The binding of a field to its source region is itself tamper-evident — a one-pixel move changes the leaf, changes the root, and fails verify.

source_region: { page, bbox: [x0, y0, x1, y1] }
block_class bound too: header | field | table_cell
tamper test: bbox moved 1px on one field fails offline verify
each field provably tied to the page coordinates it came from

Auditor sees: for every extracted value, the precise region of the document it was read from — provably unaltered.

B Tamper-evidence on value

Change the number after extraction and the receipt rejects it.

The value commitment for each field feeds its leaf. Altering a bound value — an invoice total from one amount to a larger one — changes the leaf, changes the fields_root, and the recomputed root no longer matches the signed root.

value_commit bound per field into the Merkle leaf
tamper test: total changed post-sign fails verify
no shared secret required to detect the change
detection is deterministic, not probabilistic

Auditor sees: a downstream edit to any extracted value is caught on verify, offline, with no call back to the vendor.

C Redaction-compatible provenance

The tax ID never lands in the receipt. The receipt still proves it was there.

For a sensitive field, the value commitment is a non-revealing commitment computed with a salt held off the receipt. Presence, position, class, confidence, and the fact of extraction stay provable. The cleartext never enters the chain.

sensitive value_commit = salted, off-receipt
tax_id / SSN / PAN / PHI provably absent yet provably extracted
holder with the salt re-proves the specific value on demand
wrong value fails to match the commitment

Auditor sees: the protected value existed, its class, its position — without the value itself ever being disclosed.

D Provenance, not correctness

The receipt asserts what was read — never that it is right.

The asserts field is set to extraction_provenance_only. It proves the named model read the named region and produced the bound value at the bound confidence. It does not claim the value is correct. No truth-oracle overclaim.

asserts: "extraction_provenance_only"
correctness validation, where done, is a separate signed attestation
that attestation references this receipt by its root
the limit is declared on the face of the artifact

Auditor sees: an honest claim — provenance proven, correctness scoped out — which is exactly what survives scrutiny.

The receipts on the receipts

Reduction to practice. Real values. Tamper rejected.

A two-page invoice, nine fields, run from a fresh checkout. Six smoke criteria, six green. The hashes below are the actual outputs of that test run.

Smoke test 6 of 6 pass

1✓Independent, zero-secret, offline verify of the whole receipt
2✓Altering a bound value breaks the recomputed fields_root
3✓Moving a bounding box one pixel breaks verify
4✓Sensitive field provably absent yet provably extracted
5✓Holder re-proves the redacted value; wrong value fails
6✓One ML-DSA-65 signature per document, field-count independent

Signed test-run outputs

document commitment

0x4eb93cd5ffb6eedc57d0d1e358a47bab8d223bbcb5d8f4e70d7e27e2b7af6ab4

accumulated fields root

0x64c489a554ee2f8fa380cb5b2ff3c16465cecbc26ec8f76824f514dcea8a34a1

model ref · asserts

mir:mistral-ocr-4 · extraction_provenance_only

Nine fields: vendor, invoice no., date, bill-to tax ID (redacted), line item, amount, subtotal, tax, total. Adapters tested against two distinct extraction-engine output formats. The engine extracts; the receipt proves.

Hook in

Wrap your OCR. Issue a receipt. Verify offline.

Point AFiR-OCR at your extraction output — Mistral OCR, AWS Textract, any engine that emits regions and confidences. You get back one PQ-anchored receipt per document. Anyone with the public key verifies the whole thing offline.

extract_and_prove.js

import { buildReceipt, verify } from "@hive/afir-ocr"

// 1. run your OCR engine as usual -- AFiR-OCR does not change it
const ocr = await mistralOCR(invoicePdf)   // fields + bbox + confidence

// 2. bind every field to its source region, model, and confidence
const receipt = await buildReceipt({
  doc: invoicePdf,
  model_ref: "mir:mistral-ocr-4",
  fields: ocr.fields,                       // value, bbox, block_class, confidence
  sensitive: ["bill_to_tax_id"],          // redaction-compatible commitment
  asserts: "extraction_provenance_only",    // provenance, not correctness
})
// receipt.fields_root  -> 0x64c489...a8a34a1
// receipt.sig          -> one ML-DSA-65 signature over the whole doc

// 3. anyone verifies offline -- no shared secret, no call home
const { ok, why } = await verify(receipt, PUBLIC_KEY)
// move one bbox by a pixel, change one total -> ok === false

Who it's for

Built for document pipelines that have to be auditable.

Anywhere a model reads a regulated document and the value drives a decision or a payment.

Finance

Invoices & AP

Extracted totals and line items feed payment runs. AFiR-OCR proves the total on the receipt is the total read off the page, untampered.

Legal

Contracts & claims

Clause and figure extraction with a provable tie to the source region. Redaction-compatible for privileged or sensitive spans.

Healthcare

Clinical intake forms

PHI fields stay out of the receipt while remaining provably extracted. The data class lands on the receipt; the data does not.

Platform

OCR & IDP vendors

If you sell extraction to regulated buyers, AFiR-OCR is the proof layer their auditors will ask for. Your engine becomes the distribution surface.

AFiR-OCR (DocProof) is a patent-pending Hive Civilization primitive. It composes on the live signer — each document is sealed with one ML-DSA-65 (NIST FIPS 204) signature. Smoke-test values above are the actual outputs of the reduction-to-practice run: a nine-field invoice with a redacted bill-to tax ID, adapters verified against two distinct extraction-engine output formats. The receipt asserts extraction provenance, not value correctness. Request access from the mint page. AFiR-Stream lives at /real-time/; the full AFiR lineup at /afir/.