I built a document extraction pipeline that combines OCR, bounding boxes, and targeted vision verification.
The goal was not just to extract text from PDFs. The goal was to make every extracted value auditable: what text was found, where it came from on the page, how confident the system was, and whether a vision model had to verify it.
The naive approach is to just throw an LLM at it. Feed the PDF to GPT-5 or Mistral, ask it to extract everything, hope for the best. That works for demos. It does not work when you need to process hundreds of pages reliably. LLMs hallucinate numbers, they miss fields in dense tables, they cannot tell you where on the page they found something, and they cost real money per call.
I needed something better. What I ended up with is a three-layer pipeline. PaddleOCR for fast text extraction with bounding boxes, a vision model (MiniMax M3) that verifies anything the OCR is not sure about, and an LLM downstream that structures the verified data based on spatial relationships. The OCR stage runs on Modal with T4 GPUs. The verification stage calls MiniMax M3 through OpenRouter. Results are cached so repeated document views do not trigger new OCR or vision calls.
Here is the live component showing the output from a sample bank statement I ran through the pipeline:
Why vision verification matters
Look at the "Average Balance" field in the component above. PaddleOCR read it as $045.24 with 93.3% confidence. That is a real mistake. The actual value on the document is $643.24. The OCR got confused by the surrounding layout and misread the digits.
This is exactly where the vision model earns its place. When I passed that page to MiniMax M3 with the list of low-confidence fields and their positions, it looked at the image, found the "Average Balance" row, read the correct value, and returned $643.24 with 95% confidence. It caught the error because it had the full page context, the column headers, the row alignment, the surrounding values, to disambiguate what the OCR got wrong.
Here is the thing though. You cannot just throw the whole page at an LLM and expect it to extract everything perfectly. I tried that. The LLM misses fields. It skips rows in dense tables. It hallucinates values when the text is small or the background is noisy. On this two-page sample, PaddleOCR surfaced 226 text blocks, enough to capture the full visible structure of the statement. An LLM pass over the same document typically catches significantly fewer fields, and the ones it misses tend to be the small, dense, or repetitive rows that matter most.
The vision model does not replace OCR. It verifies candidate fields already surfaced by OCR. If OCR completely misses a region of the page, this pipeline does not recover it. That would need a separate page-level detection or document-type-specific pass.
So the architecture is deliberate. OCR for breadth, vision model for depth, LLM for structuring. The OCR extracts everything it can see. The vision model verifies the uncertain parts using full page context. And then later, once you have all the verified fields with their bounding box positions, you can pass that structured data into an LLM to organize it however you need, group transactions, calculate totals, fill in forms, whatever the next step is. The bounding boxes tell the LLM exactly where each piece of data lives on the page, which gives it the spatial context to make good decisions about relationships between fields.
That three-layer approach is the core idea. Each layer does what it is best at, and the handoff between them is clean.
The core insight
The key realization was that OCR engines are actually very good at what they do. They just lack a confidence mechanism you can trust. PaddleOCR gives you a confidence score for every text block it detects. Most of the time, that score is above 95% and the text is correct. But for the cases where it drops below that threshold, noisy backgrounds, small text, ambiguous characters, you need a second opinion.
So instead of sending every page to an expensive vision model, I only send the fields that need verification. The OCR handles the majority of the work cheaply and fast. The vision model only gets called for the hard cases.
Stage 1: PaddleOCR bounding boxes
I run PaddleOCR on a Modal-hosted T4 GPU. Each text block comes back with polygon coordinates, the four corners of the detected text region, and a confidence score. The worker flattens these into axis-aligned rectangles:
@modal.cls(image=paddle_image, gpu="T4", timeout=180)
class PaddleOCRWorker:
@modal.enter()
def setup(self):
from paddleocr import PaddleOCR
self.ocr = PaddleOCR(
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_textline_orientation=False,
lang="en",
device="gpu",
)
@modal.method()
def ocr_pages(self, pages_b64):
results = []
for idx, page_b64 in enumerate(pages_b64):
img = Image.open(io.BytesIO(base64.b64decode(page_b64)))
ocr_result = self.ocr.predict(np.array(img))
text_blocks = []
for item in ocr_result:
texts = item.get("rec_texts", [])
polys = item.get("rec_polys", [])
scores = item.get("rec_scores", [])
for text, poly, score in zip(texts, polys, scores):
xs = [p[0] for p in poly]
ys = [p[1] for p in poly]
text_blocks.append({
"text": text,
"x": round(float(min(xs)), 1),
"y": round(float(min(ys)), 1),
"width": round(float(max(xs) - min(xs)), 1),
"height": round(float(max(ys) - min(ys)), 1),
"confidence": round(float(score), 4),
})
results.append({"pageNumber": idx + 1, "textBlocks": text_blocks})
return results
The polygon coordinates are in pixel space of the rendered PNG. Before they can be used as overlay positions on the PDF viewer, I scale them to PDF point space (612 x 792 for a letter page) using the actual image dimensions from sips:
const scaleX = 612 / imgW;
const scaleY = 792 / imgH;
const scaledTextBlocks = ocrResults.flatMap(r =>
r.textBlocks.map(b => ({
...b,
x: Math.round(b.x * scaleX * 10) / 10,
y: Math.round(b.y * scaleY * 10) / 10,
width: Math.round(b.width * scaleX * 10) / 10,
height: Math.round(b.height * scaleY * 10) / 10,
}))
);
This is the part that makes the bounding box overlays line up with the rendered page images in the viewer. If your coordinates are off by even a few pixels, the highlights look wrong.
Stage 2: Parsing fields from OCR blocks
Once I have scaled text blocks, I parse them into structured fields. The first parser is intentionally simple. It works well on clean label/value regions where the OCR block contains both the label and value separated by a colon or double-space:
function ocrFieldsFromBlocks(blocks: TextBlock[]): ExtractedField[] {
const fields: ExtractedField[] = [];
for (const block of blocks) {
const text = block.text.trim();
let name = text, value = text;
const colonMatch = text.match(/^(.+?):\s*(.+)$/);
if (colonMatch) {
name = colonMatch[1].trim();
value = colonMatch[2].trim();
} else {
const spaceMatch = text.match(/^(.+?)\s{2,}(.+)$/);
if (spaceMatch) {
name = spaceMatch[1].trim();
value = spaceMatch[2].trim();
}
}
fields.push({
name, value,
confidence: block.confidence ?? 1.0,
boundingBox: { x: block.x, y: block.y, width: block.width, height: block.height, pageNumber: block.pageNumber },
});
}
return fields;
}
This parser is not the final extraction layer. Dense tables, repeated columns, multi-page layouts, and documents with implicit labels need document-specific parsing or a later layout-aware structuring step. The parser gets you structured fields from the easy cases. The rest relies on the bounding box positions and the downstream LLM.
The 95% confidence threshold
Every extracted field gets a confidence score from PaddleOCR. I set three tiers:
- 95% and above. High confidence. Pass directly.
- 90% to 95%. Medium confidence. Flagged for optional review.
- Below 90%. Low confidence. Escalated for vision model verification.
Fields below 95% get routed to MiniMax M3, a multimodal vision model that receives the rendered page image along with the list of low-confidence fields and their approximate positions:
async function verifyLowConfidenceFields(
fields: ExtractedField[],
pagePaths: string[],
): Promise<Map<number, VerifiedField>> {
const lowConf = fields.filter(f => f.confidence < 0.95 && f.boundingBox);
if (lowConf.length === 0) return new Map();
// Group by page, send each page's image + field list to MiniMax M3
const byPage = new Map<number, ExtractedField[]>();
for (const f of lowConf) {
const pg = f.boundingBox.pageNumber;
if (!byPage.has(pg)) byPage.set(pg, []);
byPage.get(pg)!.push(f);
}
const results = await Promise.allSettled(
pageRequests.map(async ({ pageNum, pageFields, pageBase64 }) => {
const resp = await fetch("https://openrouter.ai/api/v1/chat/completions", {
method: "POST",
body: JSON.stringify({
model: "minimax/minimax-m3",
temperature: 0.0,
messages: [
{ role: "system", content: "You are a document verification assistant..." },
{
role: "user",
content: [
{ type: "text", text: prompt },
{ type: "image_url", image_url: { url: `data:image/png;base64,${pageBase64}` } },
],
},
],
}),
});
// Parse verified values, merge back keeping original bounding boxes
})
);
}
The key insight. I do not crop sections and send them individually. I send the full rendered page image and let the vision model locate the fields using the position hints I provide. This gives the model full context, surrounding labels, column headers, row alignment, which improves accuracy over isolated crops.
The vision model does not replace OCR. It verifies candidate fields already surfaced by OCR. If OCR completely misses a region, this pipeline does not recover it. That would need a separate detection pass.
Why MiniMax M3 for verification
I chose MiniMax M3 because it handles the cases where OCR struggles:
- Noisy backgrounds. Watermarks, logos, and background patterns.
- Small text. Line items, fine print, footnote data.
- Ambiguous characters.
0vsO,1vsl. - Multi-column layouts. Where OCR might merge adjacent columns.
The verification prompt tells the model exactly which fields to check and where to look on the page. It returns corrected values with new confidence scores, and I merge those back into the original fields while preserving the bounding box associations.
Coordinate matching
After MiniMax M3 returns corrected values, I need to match them back to the original bounding boxes so the PDF viewer overlays still work. I use a three-tier matching strategy:
- Exact text match. The verified value matches an OCR block exactly.
- Case-insensitive match. Lowercase comparison as fallback.
- Label proximity. Find the label block, grab the adjacent value block.
function matchFieldsToOCR(fields, ocrResults, scaleX, scaleY) {
const allBlocks = ocrResults.flatMap(r => r.textBlocks);
return fields.map(field => {
const pageBlocks = allBlocks.filter(b => b.pageNumber === field.boundingBox.pageNumber);
const valueStr = String(field.value).trim();
let bestBlock = pageBlocks.find(b => b.text.trim() === valueStr)
?? pageBlocks.find(b => b.text.trim().toLowerCase() === valueStr.toLowerCase())
?? findLabelProximity(pageBlocks, field.name);
return bestBlock
? { ...field, boundingBox: scaleOCRBlock(bestBlock, scaleX, scaleY) }
: field;
});
}
What this gets you
On the sample bank statement in the component above:
- PaddleOCR surfaced 226 text blocks across 2 pages.
- 6 blocks had confidence below 95%, all verified by MiniMax M3.
- One real correction:
$045.24became$643.24(Average Balance, OCR misread digits). - After verification, all 226 blocks had either high OCR confidence or a verified replacement value from MiniMax M3.
Cost-wise, the OCR ran once on Modal's T4 GPU for the 2 rendered pages. MiniMax M3 was only called for pages containing sub-95% fields, which kept the expensive vision pass limited to a small subset of the document instead of every field. The bulk of the work happens in the cheap OCR stage. The vision model is the safety net, not the workhorse.
The speed comes from only calling the vision model for the small percentage of fields that need it. The traceability comes from bounding boxes that let reviewers see exactly where each value was found on the page. The structuring comes from passing verified fields with positions into an LLM for downstream organization.
That is the pipeline. Fast OCR for the easy stuff, targeted vision verification for the hard stuff, cached results so you only pay once, and bounding boxes that make every field auditable.