Part 1
AI-Assisted EDA
15 min
AI4RA Workshop
Led by Nathan Layman
Module structure
🧪
Part 1
15 min
📊
Part 2
20 min
🏗
Part 3
15 min
☕
10 min
Activity 1 · AI-Assisted EDA
Upload a CSV, ask questions in plain English, and see what the AI gets right and what it gets wrong.
Activity 1 · Debrief
AI will confidently analyze incomplete or messy data. A polished narrative can hide the same gaps that a manual review would catch.
This is why governed data, and reproducibility, matters.
Where does the AI run?
Vendor-hosted (Google, OpenAI, Anthropic…)
Self-hosted (Llama, GPT-oss, Qwen…)
Cloud for most work, local for sensitive data
The model behind the demo
The demo used Google's Gemini 2.5 Flash, a cloud-hosted large language model (LLM). Remember Nathan's earlier explanation: LLMs predict the next token based on patterns in their training data. No understanding, no memory. Just very capable pattern matching.
That's how it could read your column names, understand your question in plain English, and write working SQL in response, all without being explicitly programmed for your dataset.
There are a lot of AI models, but
The most important model is your data model.
AI models come and go, but the way an institution organizes, governs, and shares its data determines whether any of them are useful.
Foundations
A data model defines the tables, columns, relationships, and rules that describe your data. It is the difference between a filing cabinet with labeled folders and a pile of papers on the floor.
Shared standards
A UDM is a data model that is shared across institutions. It defines a common set of tables and columns so that every institution can translate its own data into one shared language.
Your post-award system calls it AGENCY_NAME. Your ERP
calls it VENDOR_DESC. Your spreadsheet calls it "Sponsor."
Every system stores the same concept differently, and every integration
has to reconcile them from scratch.
A UDM gives every institution one shared language so tools, reports, and AI workflows work the same everywhere.
Shared standards
| Domain | Tables |
|---|---|
| Reference | Organization, AllowedValues, BudgetCategory |
| Core | Personnel, ContactDetails, Project |
| Pre-Award | RFA, RFARequirement, Proposal, ProposalBudget |
| Post-Award | Award, Modification, Terms, Subaward, CostShare, and more |
| Financial | Fund, Account, FinanceCode, Transaction, IndirectRate, Invoice |
| Personnel & Effort | ProjectRole, Effort |
| Compliance | ComplianceRequirement, ConflictOfInterest |
| Operations | ApplicationSystem, ServiceRequest |
| Faculty Dev | ProjectCohort, CohortParticipation |
| System | Document, ActivityLog |
40 tables, 100 relationships. Build an app on the UDM once and any institution that maps their data to it can use that app immediately.
Shared standards
ProjectRole, AwardBudgetPeriodAward_Number, Start_DateTableName_ID e.g. Award_ID, Personnel_IDSponsor_Organization_ID, not just Organization_IDIs_ prefix e.g. Is_Active, Is_PrimaryConsistent naming means an AI model can infer meaning from column names alone. A developer at a new institution can read the schema without documentation. Every adapter follows the same patterns.
Browse the full model at ui-insight.github.io/AI4RA-UDM
Shared standards
xkcd.com/927 — CC BY-NC 2.5
AI + Data
The hardest part of adopting a shared schema is the translation work. AI changes that equation. It can read a source schema and the UDM, then propose the mapping automatically.
Next question
We have a shared data model. Now we need a place to put it.
Infrastructure
Why a lakehouse
Who changed what, when, and why?
Schemas enforced, anomalies detected, freshness monitored.
Failover, backup, point-in-time recovery.
Storage and compute scale independently.
Open formats, interchangeable engines, no lock-in.
Structured, semi-structured, and unstructured in one platform.
A small team can operate it without dedicated platform engineers.
No redundant copies, compute scales to actual workload.
Architecture
Infrastructure
Open table formats (Apache Iceberg) mean no vendor lock-in. Swap any component without rewriting the stack.
Architecture
Raw data, exactly as received. No transformation.
Per-source mapping to UDM schema. One view per source per table.
Unified UDM tables. UNION ALL across Silver, deduped by primary key.
App-specific views. Joins, aggregations, access control.
Store everything. Retain raw files, soft deletes, and metadata. The audit trail starts here.
Translate once per source. AI or human maps each column to the UDM. No business logic, only standardization.
One truth. If two sources provide the same entity, priority rules pick the winner. Auto-generated by the platform.
Purpose-built. Each app gets exactly the view it needs. Dashboards, AI agents, and APIs all read from here.
Worked example
Tracing VERAS AGENCY_TYPE from source to application.
AGENCY_TYPE = 'Federal'
Raw, exactly as received from the VERAS API. No transformation.
Organization_Type = 'Sponsor'
CASE WHEN maps 'Federal' to UDM enum 'Sponsor'. Column renamed to UDM standard.
Same value, now part of Gold.UDM.Organization. If Banner also provides this org, dedup picks the higher-priority source.
Joined with Award data: "All active awards from Federal sponsors." Flat, access-controlled, ready for a dashboard or AI workflow.
The payoff
End of module
Download a CSV and upload it to the demo. These are synthetic research administration records.