Accuracy, Reproducibility, and Provenance Module

Module brief

Help participants decide whether AI output can be trusted.

Learning goal

Evaluate output before it enters a workflow

Participants should be able to define accuracy for a task, explain where reproducibility can fail, and require provenance before anyone acts on AI-generated output.

In-room move

Teach from familiar analytics and extraction tasks

Use document extraction, search, synthesis, and reporting examples from the earlier sessions so evaluation feels like an extension of work they already do.

Participant artifact

An evaluation plan for one workflow

The main reusable output is a short plan naming the ground truth, replication strategy, provenance requirement, and human review threshold for a workflow they care about.

Derived assets

Slide deck, Vandalizer exercises, and provenance example

The current slide deck is a condensed version of this module's evaluation framing, extraction experiments, provenance activity, and adoption decision framework.

Learning objectives

What participants should leave able to do

Measure accuracy against ground truth, schema expectations, or another authoritative source instead of trusting fluent output.
Check reproducibility by rerunning workflows and comparing outputs across controlled variations.
Require provenance and human review before high-stakes answers enter institutional workflows.
Decide what to automate, augment, or leave alone based on auditability and risk.

Lecture framing

Start with trust, not tool enthusiasm

A practical opening is to remind the audience that AI can help with extraction, cleaning, search, and synthesis, but each phase fails differently. The question is not whether the model can do something interesting. The question is what evidence would make the output trustworthy enough to use here.

Core teaching arc

Move from lifecycle risk to practical trust checks.

Lifecycle framing

Every phase has its own failure modes

Extraction can hallucinate fields, cleaning can drop valid records, search can miss relevant items, and synthesis can smooth over uncertainty. Trust requires evaluation moves that fit the phase, not a single generic quality check.

Evaluation moves

Accuracy needs a definition

Compare outputs to known correct answers or another authoritative source.
Use sampling and automated consistency checks when scale makes full review impossible.
Match the amount of validation to the stakes of the workflow.
Teach participants to distrust confidence and look for evidence.

Reproducibility and guardrails

OCR, schemas, and replication reveal instability

OCR quality sets the ceiling, data models catch structural errors, structured outputs reduce format drift, and repeated runs expose variability that a single successful demo hides.

Provenance and institutional fit

Ask where the answer came from and whether it can be audited

Participants should expect traceability for extracted requirements and use the automate, augment, or leave-alone framework when auditability is missing or the stakes are high.

Suggested teaching flow

A sequence that matches the current deck

Open with the research analytics lifecycle and ask how each phase can fail.
Define evaluation and emphasize authoritative sources rather than fluent output.
Use the Vandalizer exercise to distinguish ground-truth accuracy from replication.
Show OCR quality and schema design as upstream constraints on extraction quality.
Demonstrate multiple-pass consensus as a practical reproducibility test.
Shift to provenance with the NSF solicitation example and require source traceability.
Use a short context-for-trust setup to show what makes an analytics summary safer: reporting rules, source table, template, and human review.
Close with the final reporting exercise and then the adoption decision framework: what to automate, augment, or leave alone.

Decision aid

What context should do before a summary is shared

Define the reporting period, metric rules, and scope so the model cannot invent them.
Provide the actual table, query result, or trusted source extract behind the summary request.
Use a template that separates directly supported facts from interpretation.
Keep a human review step whenever attribution, causality, anomaly explanation, or institutional judgment is involved.
If a claim cannot be traced to visible evidence, it is not ready to send.

Example and activity

Use experiments and provenance checks to make trust visible.

Worked example

Turn Vandalizer into an evaluation lab

Rather than demoing extraction as magic, use the Vandalizer to compare clean versus scanned PDFs, models, schema enforcement, and repeated runs. Participants can see how accuracy changes when one variable moves at a time.

This makes the session feel empirical. The tool matters less than the method: define the comparison, hold the rest steady, and measure what changes.

Institutional example

Quarterly research metrics reporting still needs evidence

An AI summary can sound polished while using stale numbers, mismatching fiscal years, or inventing trends.
Trusted use requires validated source systems, explicit checks against those systems, and human review for anomalies.
This is the lead-in to the final exercise: some claims are directly supported, some are interpretive, and some have no provenance at all.

Hands-on exercise

Four experiments participants can run

Compare extraction from a clean digital PDF with a scanned version and note how OCR affects the ceiling.
Run the same task with different models and look for agreement, divergence, and failure modes.
Test structured outputs against free-form outputs and note where schemas prevent or surface errors.
Repeat the same extraction multiple times and use consensus or disagreement as a reproducibility signal.

Provenance exercise

Ask where the answer came from

Use the NSF solicitation workflow to surface deadlines, eligibility rules, page limits, special conditions, and budget requirements.
Pause before extraction and ask participants what would be easy to miss or misread by hand.
After extraction, require every important field to trace back to the exact passage, page, or section that supports it.
If a value cannot be traced, treat it as unverified and not ready to drive a workflow.

Final interactive demo

Can you defend this number?

End with a tiny two-year research metrics table and a handful of AI-generated claims. Ask the room which statements they would feel safe sending to leadership without checking. The point is to separate claims that are directly supported from claims that are interpretive, false, or missing provenance.

Every figure on the table is derived from the AI4RA synthetic sponsored-research corpus, so each cell traces back to a field in records/proposals.json, awards.json, or publications.json. That is the payoff: if a participant presses on a number, the presenter can literally open the JSON and point to the row that produced it.

0:00-0:30 Show the mini table and frame it as a dashboard snapshot plus an AI-drafted leadership summary.
0:30-1:30 Read four claims aloud and ask for a quick vote on which ones are safe to send as written.
1:30-3:00 Reveal the answer key: two claims are directly supported, one is contradicted by the numbers, and one has no provenance.
3:00-4:00 Ask what additional source evidence or review would be required before approving the risky claims. Mention that the provenance-trap claim is defeated by a one-field lookup in awards.json.
4:00-5:00 Bridge to the adoption framework: supported low-stakes outputs may be automatable, checked summaries are augment, and anything without traceability belongs in leave alone.

Discussion prompt

Questions to ask participants

What counts as ground truth in one of your real workflows?
Where would sampling, reruns, or automated consistency checks add confidence?
Which outputs need source traceability before staff can act on them?
Where is AI helpful as a drafting assistant, and where does auditability force human ownership?
Which tasks at your institution are best framed as automate, augment, or leave alone?

Facilitation support

Keep the conversation anchored in evidence, not confidence.

Speaker notes

Talking points for the presenter

Keep returning to the question: what would count as evidence that this output is right?
Emphasize that OCR quality and data-model design are upstream quality controls, not optional polish.
When participants ask which model is best, redirect to evaluation design, provenance, and auditability.
Use the bench-science analogy for replication because this audience already understands controlled variation and repeated measures.
Treat context as a supporting design choice: it helps only when it makes the resulting summary easier to verify and defend.

REACH sessions to highlight

Complementary sessions on Monday and Tuesday

E5 - Building a Crosswalk: A Practical Framework for Standardizing Messy Research Data (Mon 3:45 PM, NEWPORT).
C2 - Bridging Data Silos in Academia: Smartsheet, Tableau, and Power BI as Catalysts (Mon 1:30 PM, WEATHERLY).
D3 - Demystifying SQL: Interpreting and Building Queries for Beginners (Mon 2:30 PM, COLUMBIA).
I3 - From Manual Downloading to Automated Data Collection with APIs (Tue 2:30 PM, COLUMBIA).
F2 - Can Prompt Engineering Turn General Questions into Actionable Research Intelligence? (Tue 10:15 AM, WEATHERLY).
F1 - Lessons Learned from Implementing AI Agents for Higher Ed Research Compliance (Tue 10:15 AM, FREEDOM).

Workshop close

Wrap-up and takeaway

This is the final session. Close by reinforcing four takeaways: check accuracy against evidence, treat reproducibility as a design requirement, expect provenance for high-stakes answers, and use automate, augment, or leave alone only after auditability is clear.