Back to Sessions

AI4RA Workshop

The data lakehouse and data organization

Nathan Layman

Led by Nathan Layman

Module structure

Data Lakehouse Module Preview

Part 1

AI-Assisted EDA

15 min

Part 2

Data Models & the UDM

20 min

Part 3

The Data Lakehouse

15 min

 

Break

10 min

Activity 1 · AI-Assisted EDA

Let's explore a dataset together using AI-assisted analysis.

Upload a CSV, ask questions in plain English, and see what the AI gets right and what it gets wrong.

Launch demo →

Activity 1 · Debrief

What did you notice?

Discussion
  • Did the AI ever sound confident but wrong?
  • Were there missing values or gaps it glossed over?
  • Did you get different answers by rephrasing the same question?
  • Could you tell which results were trustworthy without checking?
  • If a colleague ran the same query, would they get the same answer?
The takeaway

AI analyzes what you give it

AI will confidently analyze incomplete or messy data. A polished narrative can hide the same gaps that a manual review would catch.

This is why governed data, and reproducibility, matters.

Where does the AI run?

Cloud vs. locally hosted models

Vendor-hosted (Google, OpenAI, Anthropic…)

  • Access to the largest, most capable models
  • No hardware to buy or maintain
  • Pay per use, scales up and down
Security: Data leaves your network. Requires enterprise data agreements, BAAs, or similar contracts. Some data (FERPA, HIPAA, export-controlled) may not be allowed to leave your infrastructure at all.

Self-hosted (Llama, GPT-oss, Qwen…)

  • Data never leaves your infrastructure
  • Full control over the model and its environment
  • Fixed cost with hardware investment up front
  • Tools like Ollama and vLLM make local deployment accessible
Security: Data stays local, but requires security expertise, infrastructure, and compute resources.

Cloud for most work, local for sensitive data

  • Use cloud AI for general tasks and high throughput
  • Small teams run desktop servers (NVIDIA DGX Spark, Mac Mini) for restricted data
  • Not built for heavy workloads, but keeps critical data on-site
Challenge: Governance complexity. You need clear policies about which models are approved for which data sources, and who enforces that.

The model behind the demo

Gemini 2.5 Flash

The demo used Google's Gemini 2.5 Flash, a cloud-hosted large language model (LLM). Remember Nathan's earlier explanation: LLMs predict the next token based on patterns in their training data. No understanding, no memory. Just very capable pattern matching.

That's how it could read your column names, understand your question in plain English, and write working SQL in response, all without being explicitly programmed for your dataset.

There are a lot of AI models, but

Which is the most important one?

 

Wrong!

Data model diagram

The most important model is your data model.

AI models come and go, but the way an institution organizes, governs, and shares its data determines whether any of them are useful.

Foundations

What is a data model?

Definition

The blueprint for how your institution stores information

A data model defines the tables, columns, relationships, and rules that describe your data. It is the difference between a filing cabinet with labeled folders and a pile of papers on the floor.

What you already have

Your systems already have data models

  • Your post-award system has one
  • Your ERP has another
  • Your HR system has a third
  • They all describe the same things differently

Shared standards

What is a Universal Data Model?

Definition

A shared data standard across institutions

A UDM is a data model that is shared across institutions. It defines a common set of tables and columns so that every institution can translate its own data into one shared language.

Sound familiar?

Every institution speaks a different data language

Your post-award system calls it AGENCY_NAME. Your ERP calls it VENDOR_DESC. Your spreadsheet calls it "Sponsor." Every system stores the same concept differently, and every integration has to reconcile them from scratch.

A UDM gives every institution one shared language so tools, reports, and AI workflows work the same everywhere.

Shared standards

The AI4RA Universal Data Model

Domain Tables
ReferenceOrganization, AllowedValues, BudgetCategory
CorePersonnel, ContactDetails, Project
Pre-AwardRFA, RFARequirement, Proposal, ProposalBudget
Post-AwardAward, Modification, Terms, Subaward, CostShare, and more
FinancialFund, Account, FinanceCode, Transaction, IndirectRate, Invoice
Personnel & EffortProjectRole, Effort
ComplianceComplianceRequirement, ConflictOfInterest
OperationsApplicationSystem, ServiceRequest
Faculty DevProjectCohort, CohortParticipation
SystemDocument, ActivityLog

40 tables, 100 relationships. Build an app on the UDM once and any institution that maps their data to it can use that app immediately.

Shared standards

A predictable, self-describing schema

Naming conventions
  • Tables: PascalCase e.g. ProjectRole, AwardBudgetPeriod
  • Columns: Snake_Case e.g. Award_Number, Start_Date
  • Primary keys: TableName_ID e.g. Award_ID, Personnel_ID
  • Foreign keys: Named by role e.g. Sponsor_Organization_ID, not just Organization_ID
  • Booleans: Is_ prefix e.g. Is_Active, Is_Primary
Why this matters

AI and humans can read the schema

Consistent naming means an AI model can infer meaning from column names alone. A developer at a new institution can read the schema without documentation. Every adapter follows the same patterns.

Browse the full model at ui-insight.github.io/AI4RA-UDM

Shared standards

The standards proliferation problem

XKCD 927: Standards

xkcd.com/927 — CC BY-NC 2.5

AI + Data

AI makes the UDM practical

Universal converter box

The hardest part of adopting a shared schema is the translation work. AI changes that equation. It can read a source schema and the UDM, then propose the mapping automatically.

  • Map each source once instead of building integrations from scratch
  • AI suggests the mapping; humans review and approve
  • Build tools once and deploy them at every institution
  • Every adapter works without AI too. It is a bonus, not a requirement.

Next question

Now how do we handle all that data?

We have a shared data model. Now we need a place to put it.

Infrastructure

Four approaches to institutional data

  • Structured, governed, fast queries
  • Pre-defined schemas enforce data quality
  • Great for reporting dashboards and compliance queries
  • Rigid schemas are expensive and slow to change
  • Cannot handle unstructured data (PDFs, images, JSON)
Best for: stable reporting needs with predictable schemas.
  • Flexible raw storage, any format, cheap
  • Store everything first, decide structure later
  • Great for archiving raw files and ML training data
  • No schema enforcement or query optimization
  • Without governance, becomes a "data swamp"
Supports structured, semi-structured, and unstructured data natively.
  • Flexible storage + structured governance layers on top
  • Open table formats (Iceberg) enable ACID transactions and schema evolution
  • Medallion architecture transforms raw data into governed views
  • Separation of storage and compute keeps costs down
  • Works for structured, semi-structured, and unstructured data
The architecture the GRANTED project chose for university research data.
  • What happens when a data lake loses governance
  • Data gets dumped in with no schema, no ownership, no documentation
  • Nobody knows what's in there, how old it is, or whether to trust it
  • Queries return results, but nobody can verify if they're correct
  • Every new project starts by re-discovering and re-cleaning the same data
This is the default outcome without deliberate governance. Most institutions are here.

Why a lakehouse

Capabilities of a data lakehouse

Trust & Governance

Who changed what, when, and why?

Data Quality

Schemas enforced, anomalies detected, freshness monitored.

Reliability

Failover, backup, point-in-time recovery.

Scalability

Storage and compute scale independently.

Openness

Open formats, interchangeable engines, no lock-in.

Multi-Modal

Structured, semi-structured, and unstructured in one platform.

Simplicity

A small team can operate it without dedicated platform engineers.

Cost Efficiency

No redundant copies, compute scales to actual workload.

Architecture

The GRANTED data lakehouse

Data Flow Bronze Silver Platinum Live Writeback

Infrastructure

Open technology, interchangeable parts

Technology choices

Every layer is replaceable

  • Ingestion: Airbyte, Fivetran, custom scripts, or our own sync engine
  • Storage: MinIO, AWS S3, PostgreSQL, or Snowflake
  • Query: Trino, DuckDB, Spark, or Dremio
  • Applications: Tableau, Power BI, LLM agents, or custom apps

Open table formats (Apache Iceberg) mean no vendor lock-in. Swap any component without rewriting the stack.

Implementation patterns

Two paths to a governed data layer

  • Medallion (recommended): Full Bronze → Silver → Gold → Platinum pipeline. Best for multi-source, multi-institution use cases.
  • Direct database: Connect existing databases and map tables directly to UDM views. Faster to start, but harder to govern at scale.

Architecture

The medallion layers

Bronze

Raw data, exactly as received. No transformation.

Silver

Per-source mapping to UDM schema. One view per source per table.

Gold

Unified UDM tables. UNION ALL across Silver, deduped by primary key.

Platinum

App-specific views. Joins, aggregations, access control.

Store everything. Retain raw files, soft deletes, and metadata. The audit trail starts here.

Translate once per source. AI or human maps each column to the UDM. No business logic, only standardization.

One truth. If two sources provide the same entity, priority rules pick the winner. Auto-generated by the platform.

Purpose-built. Each app gets exactly the view it needs. Dashboards, AI agents, and APIs all read from here.

Worked example

Following data through the layers

Tracing VERAS AGENCY_TYPE from source to application.

Bronze

AGENCY_TYPE = 'Federal'
Raw, exactly as received from the VERAS API. No transformation.

Silver

Organization_Type = 'Sponsor'
CASE WHEN maps 'Federal' to UDM enum 'Sponsor'. Column renamed to UDM standard.

Gold

Same value, now part of Gold.UDM.Organization. If Banner also provides this org, dedup picks the higher-priority source.

Platinum

Joined with Award data: "All active awards from Federal sponsors." Flat, access-controlled, ready for a dashboard or AI workflow.

The payoff

What well-organized data lets you do

With governed data
  • Ask the same question at any institution and get a comparable answer
  • Build AI workflows that cite their sources and can be audited
  • Generate reproducible reports that match across departments
  • Automate routine tasks with confidence in the underlying data
Without it
  • AI blends contradictory sources into confident-sounding wrong answers
  • Every new report requires a new integration from scratch
  • Nobody knows which spreadsheet is current or authoritative
  • Automation multiplies errors instead of eliminating them

End of module

Back to Sessions

Return to session list

Sample datasets for Data Crawler Carl

Download a CSV and upload it to the demo. These are synthetic research administration records.