Evaluate output before it enters a workflow
Participants should be able to define accuracy for a task, explain where reproducibility can fail, and require provenance before anyone acts on AI-generated output.
Module 3 - Led by Barrie Robison
This module helps participants decide whether AI output is trustworthy enough to use. The arc moves from evaluation across the data lifecycle to reproducibility testing, provenance checks, and the final decision about what to automate, augment, or leave alone.
Module brief
Participants should be able to define accuracy for a task, explain where reproducibility can fail, and require provenance before anyone acts on AI-generated output.
Use document extraction, search, synthesis, and reporting examples from the earlier sessions so evaluation feels like an extension of work they already do.
The main reusable output is a short plan naming the ground truth, replication strategy, provenance requirement, and human review threshold for a workflow they care about.
The current slide deck is a condensed version of this module's evaluation framing, extraction experiments, provenance activity, and adoption decision framework.
A practical opening is to remind the audience that AI can help with extraction, cleaning, search, and synthesis, but each phase fails differently. The question is not whether the model can do something interesting. The question is what evidence would make the output trustworthy enough to use here.
Core teaching arc
Extraction can hallucinate fields, cleaning can drop valid records, search can miss relevant items, and synthesis can smooth over uncertainty. Trust requires evaluation moves that fit the phase, not a single generic quality check.
OCR quality sets the ceiling, data models catch structural errors, structured outputs reduce format drift, and repeated runs expose variability that a single successful demo hides.
Participants should expect traceability for extracted requirements and use the automate, augment, or leave-alone framework when auditability is missing or the stakes are high.
Example and activity
Rather than demoing extraction as magic, use the Vandalizer to compare clean versus scanned PDFs, models, schema enforcement, and repeated runs. Participants can see how accuracy changes when one variable moves at a time.
This makes the session feel empirical. The tool matters less than the method: define the comparison, hold the rest steady, and measure what changes.
End with a tiny two-year research metrics table and a handful of AI-generated claims. Ask the room which statements they would feel safe sending to leadership without checking. The point is to separate claims that are directly supported from claims that are interpretive, false, or missing provenance.
Every figure on the table is derived from the AI4RA synthetic
sponsored-research corpus, so each cell traces back to a field in
records/proposals.json, awards.json, or
publications.json. That is the payoff: if a
participant presses on a number, the presenter can literally open
the JSON and point to the row that produced it.
awards.json.Facilitation support
This is the final session. Close by reinforcing four takeaways: check accuracy against evidence, treat reproducibility as a design requirement, expect provenance for high-stakes answers, and use automate, augment, or leave alone only after auditability is clear.
Derived assets
A Reveal.js slide outline is available for live delivery, workshop rehearsal, and follow-up refinement. It should remain a condensed version of the lifecycle framing, Vandalizer evaluation exercises, provenance activity, short context-for-trust setup, final analytics demo, and adoption decision framework documented here.