Back to Sessions

AI4RA Workshop

Output we can trust: evaluation strategies for accuracy and reproducibility

Barrie Robison

Led by Barrie Robison

AI across the data lifecycle

AI can add value at every phase of the research analytics pipeline — but each phase has different accuracy risks and verification needs.

Creation & extraction

OCR, document parsing, structured extraction from PDFs. Turning unstructured documents into data.

Click to connect to Session 1

Creation & extraction

In Session 1, we used Vandalizer to extract structured data from a research document.

It felt easy — but how do we know the extracted data were correct?

That question is the starting point for everything in this session.

Cleaning & janitorial work

Deduplication, normalization, schema validation, fixing messy records. Making data consistent and trustworthy.

Click to connect to Session 2

Cleaning & janitorial work

In Session 2, we explored the Unified Data Model — a shared schema that standardizes messy institutional data.

The UDM defines what clean looks like. But how do we verify that AI-assisted mapping and normalization got it right?

Searching & selecting

Natural language filtering, semantic search, embeddings, RAG. Finding what you need without writing queries.

Click to connect to Session 2

Searching & selecting

In Session 2, we used Data Crawler Carl to search and filter research data using natural language.

It found results fast — but did it find the right ones? Did it miss anything?

Analysis & synthesis

Summarization, cross-document comparison, trend identification, reporting. Turning data into decisions.

Click to connect to Session 2

Analysis & synthesis

Session 2 brought this together: the data lakehouse organizes data for analysis, and Data Crawler Carl synthesizes across it.

When AI summarizes or compares, how do we know the synthesis is complete and accurate?

Evaluation

Can you trust the output?

  • Every lifecycle phase has different failure modes — extraction hallucinates fields, cleaning drops valid records, search misses relevant documents, synthesis fabricates trends
  • AI sounds confident even when wrong — your job is to verify at every phase
  • Check results against authoritative sources, not against what sounds right
  • Design verification strategies proportional to the stakes

Human verification of every record doesn't scale.

Why it doesn't scale

Where human verification breaks down

Three shapes the bottleneck takes in real research-analytics workflows, plus an open slot for the room. Different axes, same conclusion.

Batch extraction at volume

An overnight pipeline runs thousands of documents through OCR, extraction, and classification. Every proposal, award letter, or subaward agreement produces dozens of structured fields. Reading each one before anyone acts is a full headcount's worth of work — that never catches up.

Autonomous agents on live data

Agents crawl the data lakehouse making thousands of classification, join, and lookup decisions per second. Each decision looks reasonable in isolation. Humans never see the intermediate steps — only the summary — so a subtle drift compounds invisibly across millions of ops before anyone notices.

Continuous compliance triage

Every COI disclosure, cost transfer, and effort variance is auto-routed into clear, escalate, or review. A silent rule drift — a prompt tweak, a schema change, a new sponsor edge case — can redirect hundreds of cases into the wrong bucket before audit season catches it.

Your turn

Where else at your institution does AI-assisted work outpace any reasonable human review? What's the fourth shape?

Call them out — we'll add the best ones to the card.

So the job stops being "review every output" and starts being "design the verification pipeline" — sampling, automated consistency checks, and reproducibility tests, scaled to match the throughput.

Hands-on exercise

Defining and measuring accuracy

  • In Session 1 we used the Vandalizer to extract structured data from documents — now we ask: how good was it?
  • We'll define accuracy two ways:
    • Ground truth comparison — measure extracted values against known correct answers
    • Replication — run the same extraction multiple times and measure consistency
  • These evaluation strategies can be applied to many types of analytics pipelines
Try it with any tool

We'll use the Vandalizer today, but this experiment works with any AI tool — ChatGPT, Claude, Gemini, Copilot, or whatever your institution provides. The evaluation method is what matters, not the tool.

Critical components that affect extraction

  • Optical character recognition (OCR) is your ceiling. If the PDF is scanned, extraction quality can never exceed OCR quality. Dedicated OCR engines and multimodal vision models handle layouts that basic text extraction misses.
  • Garbage in, garbage out — OCR errors propagate silently through every downstream step
  • This is a problem you solve before you choose an LLM
What varies in OCR
  • Scan quality — resolution, skew, noise
  • Document complexity — tables, multi-column layouts, handwriting
  • Engine choice — traditional OCR vs. multimodal vision models
  • A clean digital PDF sidesteps all of this — but not every document is digital

The data model is your guardrail

  • Building on Session 2 — defining the expected schema before extraction means structural errors are caught automatically
  • If the model expects a date and gets a dollar amount, that's a flag, not a guess
  • Without a schema, the model invents its own structure — and it will be different every time
  • Structured outputs. Some models support a mode that forces output to conform to a JSON schema. This eliminates parsing failures and guarantees the shape of extracted data matches your data model.
Schema as guardrail
  • Defines expected fields, types, and constraints
  • Makes validation automatic — wrong types and missing fields surface immediately
  • The UDM from Session 2 is exactly this kind of shared schema

Extraction reproducibility

Consistency through replication

  • Recall the token-by-token demo from Session 1 — LLMs sample from a probability distribution, so the same prompt can produce different output each time. Replication is how we make that variability visible.
  • Multiple-pass consensus. Run the same extraction multiple times and compare results. Agreement across replicates increases confidence; disagreement flags items for human review.
  • Borrow from bench science. Replication is how researchers separate signal from noise — and it works the same way here. If three runs converge on the same answer, that's evidence. If they don't, you've identified an item that needs scrutiny.
What replication tells you
  • If three passes agree on a deadline, trust it
  • If three passes disagree on a budget figure, route it for human review
  • Consistency across replicates is a direct measure of reproducibility
  • Three runs is the floor for a meaningful signal. Scale up when the stakes are higher.

The experiment

Measuring what matters

Vary one thing at a time and measure the effect — just like bench science. Click each card to see how to control it in the Vandalizer.

1. OCR quality

Extract from a clean digital PDF, then from a scanned version of the same document. Compare the results. How much accuracy did you lose?

Click for setup

2. Model comparison

Run the same extraction with different LLMs. Do they agree? Where do they diverge? Does a bigger model always win?

Click for setup

3. Structured outputs

Run with and without JSON schema enforcement. Does constraining the output shape reduce errors or introduce new ones?

Click for setup

4. Replicate consensus

Run the same extraction 3 times. Where do replicates agree? Where do they disagree? Consensus across replicates is a direct measure of reproducibility.

Click for setup

Each exercise gives you a menu of synthetic documents and a canonical extraction prompt. Pick a document, hold it constant across steps, and isolate the one variable for that exercise. The next four slides provide the materials and directions.

Exercise 1 setup

OCR quality is a property of the tool

Question. "Works for me" in one tool is not an argument that OCR is solved. How far does the same scanned document travel depending on which pipeline touches it first?

  1. Pick one of the tier 3 PDFs. Run it through the Vandalizer with the extraction prompt and schema. Notice how much dotsocr recovers.
  2. Now take the same PDF to a tool you'd actually use day to day: paste or upload it to ChatGPT, Claude, Copilot, a generic "PDF to text" converter, or the text-only extract from your browser.
  3. Score each tool's output against the ground truth JSON. Mark missing, garbled, and silently collapsed fields.
  4. The budget table and the stamped/highlighted header are the usual breaking points. Predict which tool will cope before you run it.
Materials

Documents (synthetic; pick any)

Ground truth and prompt

Prompt is tuned for NSF Award Notices; using it on NIH NoA or subaward is itself a test of how it generalizes. Source: AI4RA/prompt-library

OCR is a tool-level property, not something the model handles for you. Know which of your tools have a dedicated OCR stage and which silently pass through whatever text they can reach.

Exercise 2 setup

Cross-tool comparison: where do they diverge?

Question. Same prompt, same document, different AI tools in different browser tabs. Where do they agree, and where does disagreement flag fields that need review?

  1. Pick two or three AI tools you actually use: ChatGPT, Claude, Copilot, Gemini, Vandalizer, a local model. Open each in its own tab.
  2. Paste the canonical extraction prompt (prompt.md) into each tool. Upload the clean PDF. Clean input isolates the tool variable so OCR is not in the mix.
  3. Save every tool's JSON output. Score each against the ground truth field by field.
  4. Build a per-tool agreement matrix. Fields where every tool agrees are cheap to trust. Fields where tools diverge are where human review earns its keep.
Materials

Documents (synthetic; pick any)

Ground truth and prompt

The prompt is deliberately LLM-agnostic so it runs in any tool. Source: AI4RA/prompt-library

Bigger is not always better. The disagreement pattern across tools is itself a verification signal. If every tool agrees on a field, trust it. If they split, look.

Exercise 3 setup

Structured outputs: does the schema help?

Question. Does constraining the output to a JSON schema reduce errors, or does it introduce new failure modes like silent null-filling?

  1. Run the extraction prompt on the clean PDF in the Vandalizer with schema enforcement on (structured output mode).
  2. Run the same prompt and PDF with schema enforcement off. Ask for JSON in the prompt but don't constrain it.
  3. Parse both outputs. Record parse successes, field completeness, and any fields the model silently filled or skipped.
  4. Score against the ground truth. Separate "parse failure" from "wrong value" in your tally, they are different failure modes.
Materials

Documents (synthetic; pick any)

Ground truth and prompt

Prompt source: AI4RA/prompt-library

Schemas buy you valid parses. They also make null-filling look confident. Separate "did it parse?" from "is the value right?" in your measurements.

Exercise 4 setup

Replicate consensus: what is reproducible?

Question. Same prompt, same document, same model, three runs. Which fields land on the same answer every time, and which shift between runs?

  1. In the Vandalizer, lock the model, prompt, and schema. Run the extraction three times on the tier 3 PDF. Clean PDF works too if the room is tight on time; tier 3 just exposes more variance.
  2. Collect all three JSON outputs.
  3. Compare field by field across the three runs. Mark each field as concordant (all three match) or discordant.
  4. Score the concordant set against ground truth. Route the discordant set to human review.
Materials

Documents (synthetic; pick any)

Ground truth and prompt

Prompt source: AI4RA/prompt-library

Concordance is cheap confidence. Discordance is your routing rule for human review. Three runs is the floor for a meaningful signal.

Auditability and provenance

Where did this answer come from?

  • Accuracy is one half of trust. Provenance is the other — for every extracted value, can we point back to where it came from in the source document?
  • Our next Vandalizer workflow extracts structured requirements from a real NSF solicitation and populates a data model — deadlines, eligibility, page limits, special conditions, budget rules
  • This solicitation was chosen on purpose: it has a lot of special requirements that are easy to miss and easy to misread
  • Before we run the extraction, take a few minutes to skim the solicitation yourself. Notice what would be tedious to catalog by hand.

And yes — to get the PDF, NSF tells you to print the page to PDF in your browser. Yes, really.

What to look for
  • Eligibility constraints and PI requirements
  • Submission deadlines and required components
  • Page and format limits
  • Special conditions and unusual requirements
  • Budget rules, cost-share, and indirect cost guidance
Prompt — checklist
Prompt — structured JSON
Validated output — NSF 26-508

Prompts and validated case from the public AI4RA/prompt-library. NSF 26-508 (TechAccess) is a separate validated eval case that lets you preview what the extractor produces before we run it live on NSF 25-541.

Before you send it

What context makes a summary safer?

  • Define the reporting period and metric rules so the model does not invent them.
  • Give the model the actual source table or query output, not a vague request for a narrative.
  • Use a template that separates directly supported facts from interpretation.
  • Keep a human review step for attribution, causality, anomalies, and edge cases.
What context should do
  • Reduce ambiguity about definitions
  • Expose the source behind each number
  • Make each claim easier to check
  • Set clear escalation boundaries

Decision rule

Use the thinnest context that preserves traceability

  • If a claim is directly visible in a trusted table, AI can help draft the wording.
  • If a claim depends on joins, business rules, or interpretation, show the sources and keep a human in the loop.
  • If a claim cannot be traced to a source, it is not ready to send.
  • Context is useful only when it makes the output more defensible.

That is the test in the next exercise: which claims survive once you ask for evidence?

Final demo

Can you defend this number?

Mini reporting table

Prompt: "Write a 3-sentence leadership summary from this table."

Metric FY2024 FY2025
Proposals submitted 11 6
Awards received 1 9
Award dollars obligated $825K $5.45M
Publications linked to awards 7 13

Source: AI4RA synthetic sponsored-research corpus. Every cell traces to a field in records/proposals.json, awards.json, or publications.json.

Audience prompt

Which claims would you feel safe sending to leadership right now?

  1. Award dollars obligated grew more than six-fold, from $825K to $5.45M. Supported: both values are visible in the table.
  2. Publications linked to awards rose from 7 to 13. Supported: both values are visible in the table.
  3. A growing proposal pipeline drove the surge in awards. Contradicted: proposals submitted actually fell from 11 to 6. Awards lag submissions by a review cycle.
  4. Industry sponsors powered most of the growth. No provenance: the aggregate hides sponsor mix. Answer lives in awards.json · sponsor_code, not the summary.

Only claims 1 and 2 are defensible from the visible evidence. The rest need either correction or additional source traceability.

Decision framework

Adoption and institutional fit

  • Automate: repetitive, low-stakes, output-driven tasks with clear validation
  • Augment: human-in-the-loop tasks where AI prepares and humans decide
  • Leave alone: process-critical work that requires full auditability

Discussion

What do you need at your institution?

  • Which dashboard claims at your institution would you send to leadership without manual checking?
  • What counts as ground truth in one of your real analytics workflows?
  • Where would reruns, sampling, or automated consistency checks add confidence?
  • Which outputs need source traceability before staff can act on them?
  • Which tasks are best framed as automate, augment, or leave alone?

Continuous learning

This landscape changes monthly

  • Today's best practices may be outdated by next quarter — build a habit, not just a checklist
  • Sponsor AI policies are updated with each funding cycle — check before every submission
  • New tools launch constantly — evaluate them against the governance criteria from Module 2
  • Your institution's AI policy is probably being rewritten right now — stay involved in that process
Trusted sources
Monday morning
  • Pick one analytics claim you're about to send to leadership. Write down the source rows or fields a skeptic would need to reach to prove it.
  • Pick one workflow currently owned by humans. Draft its verification plan: ground truth, sampling rate, trigger for full review.
  • Run the same extraction through two tools you use (Vandalizer + whatever you normally paste into). Compare where they disagree.
  • Take the "defend this number" exercise to your team. Pass around a dashboard screenshot, vote on which claims would survive a press-for-evidence.

3:50–4:00 PM

Wrap-up and Q&A

Workshop takeaway

What to take home

  • The most important model is the data model
  • Reproducibility comes from organization, not from AI
  • AI is a tool, not an authority — someone must always own the decision
  • Start with the thinnest layer that solves the problem reliably

Resources

Keep going

Scan to visit ai4ra.uidaho.edu