Comparing 5 models on structured extraction from printed business documents

Hi there! This is Sakasegawa-chan (@gyakuse)!

In my previous article I compared 19 models on Japanese handwritten note OCR. As a follow-up, this time I'd like to compare 5 models on extracting structured data from printed business documents (invoices, receipts, and business cards).

I generated 30 synthetic documents like the ones below and asked each model to extract structured data conforming to a JSON Schema, then measured accuracy.

Invoice sample

Receipt sample Business card sample

How this differs from last time

Last time I measured character recognition accuracy: "can the model read handwriting?". This time I'm measuring "can the model put the right values into the right fields, given a printed document?".

Concretely, I hand the model an image of an invoice and ask: "Who is the vendor_name (issuer)? What is the total_amount? How are the line_items structured?" The model has to return structured data following a JSON Schema.

This isn't pure OCR. It's a 3-step task: read the characters → understand the document structure → assign values to schema fields. I leveraged each provider's structured output API to evaluate this.

Provider	Structured output method
Claude	tool_use
Gemini	response_schema
OpenAI	json_schema (strict)

Claude also supports json_schema mode, but it has a limit of 16 nullable (union-typed) parameters. The invoice schema has 20 nullable fields, which exceeds that limit, so I'm using tool_use instead.

How the synthetic dataset is built

Automating data generation with an Agent Skill

I didn't build the evaluation dataset by hand. It's auto-generated by a generate-business-doc skill I implemented as a Claude Code Agent Skill. Just typing /generate-business-doc invoice 10 produces a complete set of data (JSON + HTML + PNG) for 10 invoices.

The skill is structured as an Orchestration: it launches two subagents in sequence.

/generate-business-doc invoice 10
  │
  ├─ Step 1: Check manifest
  │   └─ Look at existing data coverage and identify uncovered combinations
  │
  ├─ Step 2: content-generator subagent
  │   └─ Generate JSON Schema + 10 ground truth JSON files
  │
  ├─ Step 3: renderer subagent
  │   └─ Generate unique HTML/CSS from each JSON → screenshot via Playwright
  │
  └─ Step 4: Update manifest

content-generator: realistic Japanese business data

The content-generator subagent produces realistic Japanese business data spread across industries, regions, and sizes. It looks at the manifest's coverage and makes calls like "we have a lot of IT industry, so let's prioritize medical and construction next."

The data has these axes of diversity.

Axis	Variations
Industry	IT, manufacturing, F&B, construction, retail, medical, real estate, education
Region	Hokkaido, Tohoku, Kanto, Chubu, Kinki, Chugoku, Shikoku, Kyushu
Size	small (1-2 lines, <¥10,000), medium (3-5 lines), large (6+ lines, >¥500,000)

Numerical consistency (subtotal + tax = total) is also part of the generation rules, and that becomes the ground truth as is.

renderer: HTML generation without templates

This is the most fun part. The renderer subagent doesn't use fixed templates. The LLM generates fresh HTML/CSS every time. So even for the same "invoice", you get a monochrome minimalist design one time, a navy-blue header with a striped line-item table the next time, and so on. Every output looks different.

I put a lot of care into the receipts in particular — they faithfully reproduce the look of POS thermal printer output. The skill instructions are pretty detailed.

Color is black text on white only (no color accents)
Font is M PLUS 1 Code (monospace) to mimic thermal print
Separator lines are not CSS borders — they're repeated text characters like ━━━━━━
<table> elements are forbidden. Use flexbox or text-align for layout
Total emphasis is via inverted display (black background + white text), large font, or bold + letter-spacing

Paper widths are 58mm (220px) and 80mm (300px). Separator characters span 5 variations: ━, ─, ＝, *, -. Line-item display has 3 patterns: 1-line, 2-line, and quantity inline. All mixed.

For Playwright screenshots I use device_scale_factor=2. Invoices capture as A4 (794x1123), business cards as 91mm x 55mm (346x210), and receipts use full_page.

This kind of "auto-generate diverse-layout synthetic data" task plays directly into Agent Skill's strengths. Template-based approaches hit a diversity ceiling, but if you ask an LLM to "make a different layout each time," you really do get something different every time.

The 5 models compared

Model	Provider	Structured output method
claude-4.6-opus	Anthropic	tool_use
claude-4.5-sonnet	Anthropic	tool_use
gemini-3.1-pro-preview	Google	response_schema
gemini-3-flash-preview	Google	response_schema
gpt-5.4	OpenAI	json_schema (strict)

The previous article compared 19 models including 11 OSS models, but this time it's API models only. Structured output (output that conforms to a JSON Schema) requires API-side schema enforcement, so OSS models without tool_use or response_schema are out of scope.

Evaluation methodology

Field-level accuracy

Evaluation is per-field. For each field, I compare the prediction against ground truth and assign a score from 0.0 to 1.0.

String fields: NFKC normalized + whitespace stripped, then compared via Normalized Levenshtein Similarity
Numeric fields: 1.0 if exact match, deducted based on diff
Date fields: normalize Japanese era / slash notations, then exact-match
Array fields (line_items): optimal matching via Hungarian algorithm, then per-element comparison

The mean of all field scores is the document's accuracy.

parse / schema success rate

Since I'm using structured output APIs, parse basically succeeds 100% of the time (the JSON is always valid). Schema compliance verifies presence of required fields and the structure of nested objects.

Results

Evaluation results across 30 printed business documents.

Rank	Model	Accuracy	Parse	Schema	Avg Time
1	claude-4.6-opus	0.9931	100%	100%	10.4s
2	gemini-3-flash-preview	0.9925	100%	100%	9.9s
3	gemini-3.1-pro-preview	0.9909	100%	100%	19.4s
4	gpt-5.4	0.9900	100%	100%	6.9s
5	claude-4.5-sonnet	0.9733	100%	100%	10.0s

All models hit 100% parse/schema, and the top 4 are within 0.3% of each other. In the previous handwriting OCR test, Gemini 3.1 Pro topped the chart at 0.924 with GPT-5.4 down at 10th with 0.714 — a huge gap. For printed-document structured extraction, they're basically tied. Reading printed text is taken for granted; the differentiator is structural understanding and field assignment accuracy.

Per-document-type accuracy

Model	Invoice	Receipt	Business card
claude-4.6-opus	0.9886	0.9906	1.0000
gemini-3-flash-preview	0.9888	0.9887	1.0000
gemini-3.1-pro-preview	0.9901	0.9825	1.0000
gpt-5.4	0.9884	0.9874	0.9941
claude-4.5-sonnet	0.9605	0.9601	0.9991

Both Claude Opus and the Gemini models scored a perfect 1.0 on business cards. Cards have few fields and a fairly fixed layout, so it's an easy task for the top models.

Invoices and receipts get harder as line-item count grows. Receipts in particular use monospace text layout — readable for humans, but slightly different from a typical table layout, so OCR models tend to lose a bit of accuracy.

Hard fields

Looking at the fields with low accuracy across all models, a pattern emerges.

Field	All-model average accuracy	Cause
line_items	0.888	Array matching is strict. Item-name notation variation hurts
vendor_address	0.898	Address notation variation (「三丁目」↔「3-」, with/without postal code)
client_address	0.909	Same as above
bank_account_holder	0.977	Katakana account name variation

The address notation variation is partially an evaluation-logic issue. "愛知県名古屋市中区栄三丁目5番12号" and "愛知県名古屋市中区栄3-5-12" mean the same thing semantically, but Levenshtein-based string comparison gives them about 0.78. All models share the same conditions, so this doesn't affect inter-model comparison, but the absolute accuracy values look slightly low.

line_items is the lowest because of the strictness of array comparison. When matching items via the Hungarian algorithm, subtle differences in item names — like full-width vs half-width parentheses in "クラウドサーバー利用料(AWSホスティング)" — start to matter.

Handwriting vs printed: rankings shuffle

When I line this up next to the previous handwriting OCR results, an interesting pattern emerges.

Model	Handwriting OCR (NLS)	Printed structured (Accuracy)
claude-4.6-opus	0.897 (4th)	0.9931 (1st)
gemini-3-flash-preview	0.918 (2nd)	0.9925 (2nd)
gemini-3.1-pro-preview	0.924 (1st)	0.9909 (3rd)
gpt-5.4	0.714 (10th)	0.9900 (4th)
claude-4.5-sonnet	0.640 (12th)	0.9733 (5th)

GPT-5.4 improved dramatically. It struggled at 10th out of 19 on handwriting OCR, but on printed structured extraction it's 4th, within 0.3% of the top model. So GPT-5.4 is bad at "reading" handwriting but good at "reading and structuring" printed text.

Conversely, Gemini 3.1 Pro was 1st on handwriting OCR, but drops to 3rd on printed structured. The gap is only 0.2% so it's basically noise, but Gemini Flash coming out on top is a bit surprising.

Claude 4.5 Sonnet is last on both tasks, but printed structured (0.9733) is dramatically higher than handwriting OCR (0.640). This shows the generational pattern of "can read print, weak on handwriting."

Speed

GPT-5.4's 6.9s average is fastest. On the previous handwriting OCR it was 123.4s — overwhelmingly slow — but for structured extraction tasks, reasoning time appears to be much shorter. Gemini Flash is also fast at 9.9s. Gemini Pro is on the slower side at 19.4s.

Summary

Printed business document structured extraction is a tight race in the top 4 models, all within 0.3%. Any of them gives sufficient accuracy
GPT-5.4, which sat at 10th on handwriting OCR, jumped to 4th on printed structured extraction. Tasks really do have model-specific strengths and weaknesses
The evaluation code and dataset are published at ocr-comparison under structured_eval/

Appendix: per-image extraction examples

Let me show actual document images alongside each model's extraction results.

Invoice: invoice_004 (monochrome, compact)

invoice_004

A monochrome, compact F&B-industry invoice with 4 line items. Claude 4.6 Opus, both Gemini models, and GPT-5.4 extract it almost perfectly. Here are just the fields where they diverged.

Field	Ground truth	claude-4.6-opus	claude-4.5-sonnet	gemini-3.1-pro	gemini-3-flash	gpt-5.4
vendor_name	株式会社なにわフードサービス	OK	NG: 有限会社心斎橋キッチン御中	OK	OK	OK
client_name	有限会社心斎橋キッチン	OK	NG: 株式会社なにわフードサービス	OK	OK	OK
vendor_phone	06-6213-4567	OK	NG: null	OK	OK	OK
vendor_address	...道頓堀二丁目3番8号...	OK	NG: ...心斎橋筋一丁目7番5号	OK (with 〒)	OK	OK
bank_account_holder	カ）ナニワフードサービス	OK	OK	OK	OK	カ）ナニワフードサービス

Claude 4.5 Sonnet swapped the vendor (issuer) and client (recipient). Looking at the image, "有限会社心斎橋キッチン御中" is displayed prominently at the top, and Sonnet apparently mistook it for the issuer. Japanese invoices put the recipient's name in a visually prominent position, so a model with shallow document-structure understanding can mix them up.

The other 4 models extracted every field correctly. GPT-5.4 has a stray space in the account holder name, but it's a harmless, minor difference.

Claude 4.6 Opus's full output looks like this.

{
  "vendor_name": "株式会社なにわフードサービス",
  "client_name": "有限会社心斎橋キッチン",
  "invoice_number": "INV-2026-0317-04",
  "issue_date": "2026-03-17",
  "due_date": "2026-04-20",
  "line_items": [
    {"description": "食材仕入(野菜・果物類)", "quantity": 30, "unit": "kg", "unit_price": 800, "amount": 24000},
    {"description": "食材仕入(精肉・鮮魚類)", "quantity": 15, "unit": "kg", "unit_price": 2200, "amount": 33000},
    {"description": "調理器具消耗品一式", "quantity": 1, "unit": "式", "unit_price": 18500, "amount": 18500},
    {"description": "店舗清掃サービス(月次)", "quantity": 4, "unit": "回", "unit_price": 9000, "amount": 36000}
  ],
  "subtotal": 111500, "tax_rate": 0.1, "tax_amount": 11150, "total_amount": 122650,
  "bank_name": "大阪シティ信用金庫", "bank_branch": "道頓堀支店",
  "bank_account_type": "普通", "bank_account_number": "2345678",
  "bank_account_holder": "カ）ナニワフードサービス"
}

Receipt: receipt_008 (58mm wide, F&B)

receipt_008

A 58mm-wide thermal-print-style receipt. Separator lines use ━, and the total is emphasized with a large font. The date on this receipt is shown as R8.03.17 (Reiwa year 8) format.

Field	Ground truth	claude-4.6-opus	claude-4.5-sonnet	gemini-3.1-pro	gemini-3-flash	gpt-5.4
store_name	味噌家名古屋栄店	OK	NG: 味噌蔵	OK	OK	OK
issue_date	2026-03-17	OK	NG: 2028-03-17	OK	OK	OK
Other fields		OK	OK	OK	OK	OK

Claude 4.5 Sonnet has 2 mistakes. Misreading "味噌家" as "味噌蔵" is an OCR accuracy problem. The other one, issue_date: 2028-03-17, is a Japanese-era conversion mistake. "R8.03.17" on the receipt means Reiwa year 8 = 2026, but Sonnet computed Reiwa 8 as 2028 (Reiwa 1 = 2019, so 2019 + 8 - 1 = 2026 is the correct answer).

The 4 other models produced identical output. Here's Claude 4.6 Opus's output.

{
  "store_name": "味噌家 名古屋栄店",
  "store_address": "愛知県名古屋市中区栄3丁目15-22",
  "store_phone": "052-263-7841",
  "store_registration_number": "T4920163857402",
  "receipt_number": "R-20260317-0391",
  "issue_date": "2026-03-17",
  "client_name": null,
  "line_items": [
    {"description": "味噌カツ定食", "quantity": 1, "unit_price": 950, "amount": 950},
    {"description": "生ビール(中)", "quantity": 1, "unit_price": 580, "amount": 580}
  ],
  "subtotal": 1530,
  "tax_rate_8": 950, "tax_amount_8": 76,
  "tax_rate_10": 580, "tax_amount_10": 58,
  "total_amount": 1664,
  "payment_method": "現金",
  "notes": null
}

Business card: business_card_005 (real estate)

business_card_005

A real estate industry business card. All 5 models produced identical output.

{
  "person_name": "中村 陽介",
  "person_name_reading": "なかむら ようすけ",
  "company_name": "株式会社四国ハウジング",
  "company_name_en": "Shikoku Housing Co., Ltd.",
  "department": "開発企画部",
  "title": "部長",
  "postal_code": "760-0033",
  "address": "香川県高松市丸の内1丁目3-2 高松センタービル10F",
  "phone": "087-822-5670",
  "fax": "087-822-5671",
  "mobile": "080-2241-3388",
  "email": "[email protected]",
  "website": "https://www.shikoku-housing.co.jp"
}

All 13 fields match exactly. Business cards have a fixed layout and few fields, so for the top models this is essentially impossible to get wrong. Claude Opus and both Gemini models hit 1.0 across all 10 business cards.

References

ocr-comparison (GitHub)
Previous article
- 日本語の手書きメモを書き起こせるOCRを探すために19モデルを片っ端から試した話
Agent Skills
- skill-creatorから学ぶSkill設計と、Orchestration Skillの作り方
Structured outputs