AI4RA Workshop

Output we can trust: evaluation strategies for accuracy and reproducibility

Led by Barrie Robison

AI across the data lifecycle

AI can add value at every phase of the research analytics pipeline — but each phase has different accuracy risks and verification needs.

Creation & extraction

OCR, document parsing, structured extraction from PDFs. Turning unstructured documents into data.

Click to connect to Session 1

Creation & extraction

In Session 1, we used Vandalizer to extract structured data from a research document.

It felt easy — but how do we know the extracted data were correct?

That question is the starting point for everything in this session.

Cleaning & janitorial work

Deduplication, normalization, schema validation, fixing messy records. Making data consistent and trustworthy.

Click to connect to Session 2

Cleaning & janitorial work

In Session 2, we explored the Unified Data Model — a shared schema that standardizes messy institutional data.

The UDM defines what clean looks like. But how do we verify that AI-assisted mapping and normalization got it right?

Searching & selecting

Natural language filtering, semantic search, embeddings, RAG. Finding what you need without writing queries.

Click to connect to Session 2

Searching & selecting

In Session 2, we used Data Crawler Carl to search and filter research data using natural language.

It found results fast — but did it find the right ones? Did it miss anything?

Analysis & synthesis

Summarization, cross-document comparison, trend identification, reporting. Turning data into decisions.

Click to connect to Session 2

Analysis & synthesis

Session 2 brought this together: the data lakehouse organizes data for analysis, and Data Crawler Carl synthesizes across it.

When AI summarizes or compares, how do we know the synthesis is complete and accurate?

Evaluation

Can you trust the output?

Every lifecycle phase has different failure modes — extraction hallucinates fields, cleaning drops valid records, search misses relevant documents, synthesis fabricates trends
AI sounds confident even when wrong — your job is to verify at every phase
Check results against authoritative sources, not against what sounds right
Design verification strategies proportional to the stakes

Human verification of every record doesn't scale.

Why it doesn't scale

Where human verification breaks down

Three shapes the bottleneck takes in real research-analytics workflows, plus an open slot for the room. Different axes, same conclusion.

Batch extraction at volume

An overnight pipeline runs thousands of documents through OCR, extraction, and classification. Every proposal, award letter, or subaward agreement produces dozens of structured fields. Reading each one before anyone acts is a full headcount's worth of work — that never catches up.

Autonomous agents on live data

Agents crawl the data lakehouse making thousands of classification, join, and lookup decisions per second. Each decision looks reasonable in isolation. Humans never see the intermediate steps — only the summary — so a subtle drift compounds invisibly across millions of ops before anyone notices.

Continuous compliance triage

Every COI disclosure, cost transfer, and effort variance is auto-routed into clear, escalate, or review. A silent rule drift — a prompt tweak, a schema change, a new sponsor edge case — can redirect hundreds of cases into the wrong bucket before audit season catches it.

Your turn

Where else at your institution does AI-assisted work outpace any reasonable human review? What's the fourth shape?

Call them out — we'll add the best ones to the card.

So the job stops being "review every output" and starts being "design the verification pipeline" — sampling, automated consistency checks, and reproducibility tests, scaled to match the throughput.

The previous slide ended on a declarative punchline: human verification doesn't scale. This slide gives that punchline a body. Why three examples rather than one: each is a different axis of "doesn't scale." - Batch extraction = throughput (volume over time, one-off outputs you never revisit). - Agents = speed and opacity (humans can't even observe the intermediate steps at line rate). - Compliance triage = drift (a continuous stream where the rule itself can silently shift). All three are live in research-admin workflows today. You probably know staff who do one of these by hand now and are being asked to automate. Closing move: pivot to the hands-on exercise. The next slide introduces the Vandalizer evaluation experiment — the concrete version of "design the verification pipeline" for extraction.

Hands-on exercise

Defining and measuring accuracy

In Session 1 we used the Vandalizer to extract structured data from documents — now we ask: how good was it?
We'll define accuracy two ways:
- Ground truth comparison — measure extracted values against known correct answers
- Replication — run the same extraction multiple times and measure consistency
These evaluation strategies can be applied to many types of analytics pipelines

vandalizer.uidaho.edu

Try it with any tool

We'll use the Vandalizer today, but this experiment works with any AI tool — ChatGPT, Claude, Gemini, Copilot, or whatever your institution provides. The evaluation method is what matters, not the tool.

Critical components that affect extraction

Optical character recognition (OCR) is your ceiling. If the PDF is scanned, extraction quality can never exceed OCR quality. Dedicated OCR engines and multimodal vision models handle layouts that basic text extraction misses.
Garbage in, garbage out — OCR errors propagate silently through every downstream step
This is a problem you solve before you choose an LLM

What varies in OCR

Scan quality — resolution, skew, noise
Document complexity — tables, multi-column layouts, handwriting
Engine choice — traditional OCR vs. multimodal vision models
A clean digital PDF sidesteps all of this — but not every document is digital

The data model is your guardrail

Building on Session 2 — defining the expected schema before extraction means structural errors are caught automatically
If the model expects a date and gets a dollar amount, that's a flag, not a guess
Without a schema, the model invents its own structure — and it will be different every time
Structured outputs. Some models support a mode that forces output to conform to a JSON schema. This eliminates parsing failures and guarantees the shape of extracted data matches your data model.

Schema as guardrail

Defines expected fields, types, and constraints
Makes validation automatic — wrong types and missing fields surface immediately
The UDM from Session 2 is exactly this kind of shared schema

Extraction reproducibility

Consistency through replication

Recall the token-by-token demo from Session 1 — LLMs sample from a probability distribution, so the same prompt can produce different output each time. Replication is how we make that variability visible.
Multiple-pass consensus. Run the same extraction multiple times and compare results. Agreement across replicates increases confidence; disagreement flags items for human review.
Borrow from bench science. Replication is how researchers separate signal from noise — and it works the same way here. If three runs converge on the same answer, that's evidence. If they don't, you've identified an item that needs scrutiny.

What replication tells you

If three passes agree on a deadline, trust it
If three passes disagree on a budget figure, route it for human review
Consistency across replicates is a direct measure of reproducibility
Three runs is the floor for a meaningful signal. Scale up when the stakes are higher.

The experiment

Measuring what matters

Vary one thing at a time and measure the effect — just like bench science. Click each card to see how to control it in the Vandalizer.

1. OCR quality

Extract from a clean digital PDF, then from a scanned version of the same document. Compare the results. How much accuracy did you lose?

Click for setup

2. Model comparison

Run the same extraction with different LLMs. Do they agree? Where do they diverge? Does a bigger model always win?

Click for setup

3. Structured outputs

Run with and without JSON schema enforcement. Does constraining the output shape reduce errors or introduce new ones?

Click for setup

4. Replicate consensus

Run the same extraction 3 times. Where do replicates agree? Where do they disagree? Consensus across replicates is a direct measure of reproducibility.

Click for setup

Each exercise gives you a menu of synthetic documents and a canonical extraction prompt. Pick a document, hold it constant across steps, and isolate the one variable for that exercise. The next four slides provide the materials and directions.

Exercise 1 setup

OCR quality is a property of the tool

Question. "Works for me" in one tool is not an argument that OCR is solved. How far does the same scanned document travel depending on which pipeline touches it first?

Pick one of the tier 3 PDFs. Run it through the Vandalizer with the extraction prompt and schema. Notice how much dotsocr recovers.
Now take the same PDF to a tool you'd actually use day to day: paste or upload it to ChatGPT, Claude, Copilot, a generic "PDF to text" converter, or the text-only extract from your browser.
Score each tool's output against the ground truth JSON. Mark missing, garbled, and silently collapsed fields.
The budget table and the stamped/highlighted header are the usual breaking points. Predict which tool will cope before you run it.

Materials

Documents (synthetic; pick any)

NSF Award OAC-2415678 · clean / tier 3
NIH NoA 1R01AI248002 · clean / tier 3
Subaward SUB-24-001 · clean / tier 3

Ground truth and prompt

Prompt is tuned for NSF Award Notices; using it on NIH NoA or subaward is itself a test of how it generalizes. Source: AI4RA/prompt-library

OCR is a tool-level property, not something the model handles for you. Know which of your tools have a dedicated OCR stage and which silently pass through whatever text they can reach.

Exercise 2 setup

Cross-tool comparison: where do they diverge?

Question. Same prompt, same document, different AI tools in different browser tabs. Where do they agree, and where does disagreement flag fields that need review?

Pick two or three AI tools you actually use: ChatGPT, Claude, Copilot, Gemini, Vandalizer, a local model. Open each in its own tab.
Paste the canonical extraction prompt (prompt.md) into each tool. Upload the clean PDF. Clean input isolates the tool variable so OCR is not in the mix.
Save every tool's JSON output. Score each against the ground truth field by field.
Build a per-tool agreement matrix. Fields where every tool agrees are cheap to trust. Fields where tools diverge are where human review earns its keep.

Materials

Documents (synthetic; pick any)

NSF Award OAC-2415678 · clean / tier 3
NIH NoA 1R01AI248002 · clean / tier 3
Subaward SUB-24-001 · clean / tier 3

Ground truth and prompt

The prompt is deliberately LLM-agnostic so it runs in any tool. Source: AI4RA/prompt-library

Bigger is not always better. The disagreement pattern across tools is itself a verification signal. If every tool agrees on a field, trust it. If they split, look.

Exercise 3 setup

Structured outputs: does the schema help?

Question. Does constraining the output to a JSON schema reduce errors, or does it introduce new failure modes like silent null-filling?

Run the extraction prompt on the clean PDF in the Vandalizer with schema enforcement on (structured output mode).
Run the same prompt and PDF with schema enforcement off. Ask for JSON in the prompt but don't constrain it.
Parse both outputs. Record parse successes, field completeness, and any fields the model silently filled or skipped.
Score against the ground truth. Separate "parse failure" from "wrong value" in your tally, they are different failure modes.

Materials

Documents (synthetic; pick any)

NSF Award OAC-2415678 · clean / tier 3
NIH NoA 1R01AI248002 · clean / tier 3
Subaward SUB-24-001 · clean / tier 3

Ground truth and prompt

Prompt source: AI4RA/prompt-library

Schemas buy you valid parses. They also make null-filling look confident. Separate "did it parse?" from "is the value right?" in your measurements.

Exercise 4 setup

Replicate consensus: what is reproducible?

Question. Same prompt, same document, same model, three runs. Which fields land on the same answer every time, and which shift between runs?

In the Vandalizer, lock the model, prompt, and schema. Run the extraction three times on the tier 3 PDF. Clean PDF works too if the room is tight on time; tier 3 just exposes more variance.
Collect all three JSON outputs.
Compare field by field across the three runs. Mark each field as concordant (all three match) or discordant.
Score the concordant set against ground truth. Route the discordant set to human review.

Materials

Documents (synthetic; pick any)

NSF Award OAC-2415678 · clean / tier 3
NIH NoA 1R01AI248002 · clean / tier 3
Subaward SUB-24-001 · clean / tier 3

Ground truth and prompt

Prompt source: AI4RA/prompt-library

Concordance is cheap confidence. Discordance is your routing rule for human review. Three runs is the floor for a meaningful signal.

Auditability and provenance

Where did this answer come from?

Accuracy is one half of trust. Provenance is the other — for every extracted value, can we point back to where it came from in the source document?
Our next Vandalizer workflow extracts structured requirements from a real NSF solicitation and populates a data model — deadlines, eligibility, page limits, special conditions, budget rules
This solicitation was chosen on purpose: it has a lot of special requirements that are easy to miss and easy to misread
Before we run the extraction, take a few minutes to skim the solicitation yourself. Notice what would be tedious to catalog by hand.

NSF 25-541 — PCL Test-Bed solicitation

And yes — to get the PDF, NSF tells you to print the page to PDF in your browser. Yes, really.

What to look for

Eligibility constraints and PI requirements
Submission deadlines and required components
Page and format limits
Special conditions and unusual requirements
Budget rules, cost-share, and indirect cost guidance

Prompt — checklist

prompt-checklist.md

Prompt — structured JSON

prompt-structured.md · schema.json

Validated output — NSF 26-508

expected.md · expected.json

Prompts and validated case from the public AI4RA/prompt-library. NSF 26-508 (TechAccess) is a separate validated eval case that lets you preview what the extractor produces before we run it live on NSF 25-541.

Before you send it

What context makes a summary safer?

Define the reporting period and metric rules so the model does not invent them.
Give the model the actual source table or query output, not a vague request for a narrative.
Use a template that separates directly supported facts from interpretation.
Keep a human review step for attribution, causality, anomalies, and edge cases.

What context should do

Reduce ambiguity about definitions
Expose the source behind each number
Make each claim easier to check
Set clear escalation boundaries

Decision rule

Use the thinnest context that preserves traceability

If a claim is directly visible in a trusted table, AI can help draft the wording.
If a claim depends on joins, business rules, or interpretation, show the sources and keep a human in the loop.
If a claim cannot be traced to a source, it is not ready to send.
Context is useful only when it makes the output more defensible.

That is the test in the next exercise: which claims survive once you ask for evidence?

Final demo

Can you defend this number?

Mini reporting table

Prompt: "Write a 3-sentence leadership summary from this table."

Metric	FY2024	FY2025
Proposals submitted	11	6
Awards received	1	9
Award dollars obligated	$825K	$5.45M
Publications linked to awards	7	13

Source: AI4RA synthetic sponsored-research corpus. Every cell traces to a field in records/proposals.json, awards.json, or publications.json.

Audience prompt

Which claims would you feel safe sending to leadership right now?

Award dollars obligated grew more than six-fold, from $825K to $5.45M. Supported: both values are visible in the table.
Publications linked to awards rose from 7 to 13. Supported: both values are visible in the table.
A growing proposal pipeline drove the surge in awards. Contradicted: proposals submitted actually fell from 11 to 6. Awards lag submissions by a review cycle.
Industry sponsors powered most of the growth. No provenance: the aggregate hides sponsor mix. Answer lives in awards.json · sponsor_code, not the summary.

Only claims 1 and 2 are defensible from the visible evidence. The rest need either correction or additional source traceability.

Five-minute run: 1. Say: "Pretend this is the last slide before a leadership update and AI drafted the narrative for you." 2. Ask the room to vote, by hand, on which claims they would send as written. 3. Reveal the fragments one by one and make them justify their choices from the visible evidence. 4. Ask what evidence they would need before approving claims 3 and 4. 5. Bridge to the next slide: if a human still needs to validate and defend the summary, that is augment. If you cannot trace the claim at all, leave it alone. Provenance note: every cell on this table is derived from the AI4RA synthetic sponsored-research corpus. If a participant presses on a number, you can literally open the JSON: proposals submitted come from proposals.json (counting submission_deadline by FY), awards received and award dollars come from awards.json (award_received_date, total_obligated), publications linked come from publications.json (year, award_ids). That is what defensibility looks like in practice: a reporter can trace every figure to a field in a file. Claim 4 callback: the corpus DOES contain an industry sponsor (Pharmakom SRA with Meyer), but it is one award out of nine in FY25. So "industry drove growth" is not just unprovable from the aggregate, it is wrong once you drop into sponsor_code. Good moment to point out that AI summaries often sound plausible AND are defeated by a single query against the source.

Decision framework

Adoption and institutional fit

Automate: repetitive, low-stakes, output-driven tasks with clear validation
Augment: human-in-the-loop tasks where AI prepares and humans decide
Leave alone: process-critical work that requires full auditability

Discussion

What do you need at your institution?

Which dashboard claims at your institution would you send to leadership without manual checking?
What counts as ground truth in one of your real analytics workflows?
Where would reruns, sampling, or automated consistency checks add confidence?
Which outputs need source traceability before staff can act on them?
Which tasks are best framed as automate, augment, or leave alone?

Continuous learning

This landscape changes monthly

Today's best practices may be outdated by next quarter — build a habit, not just a checklist
Sponsor AI policies are updated with each funding cycle — check before every submission
New tools launch constantly — evaluate them against the governance criteria from Module 2
Your institution's AI policy is probably being rewritten right now — stay involved in that process

Trusted sources

EDUCAUSE — Higher ed technology research and AI guidance
NCURA — AI in research administration community
NIH Grants and NSF Policy — Sponsor AI guidance direct from the source
ai4ra.uidaho.edu — This project's ongoing work

Monday morning

Pick one analytics claim you're about to send to leadership. Write down the source rows or fields a skeptic would need to reach to prove it.
Pick one workflow currently owned by humans. Draft its verification plan: ground truth, sampling rate, trigger for full review.
Run the same extraction through two tools you use (Vandalizer + whatever you normally paste into). Compare where they disagree.
Take the "defend this number" exercise to your team. Pass around a dashboard screenshot, vote on which claims would survive a press-for-evidence.

3:50–4:00 PM

Wrap-up and Q&A

Workshop takeaway

What to take home

The most important model is the data model
Reproducibility comes from organization, not from AI
AI is a tool, not an authority — someone must always own the decision
Start with the thinnest layer that solves the problem reliably

Resources

Keep going

Vandalizer — AI-powered document intelligence for research administration
Promptulus — Practice AI literacy skills
Workshop site — Course content, teaching modules, slide decks
When Can AI Be Used in RA? — The three-zone decision framework
Repository — All materials are open source