Back to Sessions

AI4RA Workshop

The data lakehouse and data organization

Led by Nathan Layman

Module structure

Data Lakehouse Module Preview

Part 1

AI-Assisted EDA

15 min

Part 2

Data Models & the UDM

20 min

Part 3

The Data Lakehouse

15 min

Break

10 min

Activity 1 · AI-Assisted EDA

Let's explore a dataset together using AI-assisted analysis.

Upload a CSV, ask questions in plain English, and see what the AI gets right and what it gets wrong.

Launch demo →

Activity 1 · Debrief

What did you notice?

Discussion

Did the AI ever sound confident but wrong?
Were there missing values or gaps it glossed over?
Did you get different answers by rephrasing the same question?
Could you tell which results were trustworthy without checking?
If a colleague ran the same query, would they get the same answer?

The takeaway

AI analyzes what you give it

AI will confidently analyze incomplete or messy data. A polished narrative can hide the same gaps that a manual review would catch.

This is why governed data, and reproducibility, matters.

Where does the AI run?

Cloud vs. locally hosted models

Vendor-hosted (Google, OpenAI, Anthropic…)

Access to the largest, most capable models
No hardware to buy or maintain
Pay per use, scales up and down

Security: Data leaves your network. Requires enterprise data agreements, BAAs, or similar contracts. Some data (FERPA, HIPAA, export-controlled) may not be allowed to leave your infrastructure at all.

Self-hosted (Llama, GPT-oss, Qwen…)

Data never leaves your infrastructure
Full control over the model and its environment
Fixed cost with hardware investment up front
Tools like Ollama and vLLM make local deployment accessible

Security: Data stays local, but requires security expertise, infrastructure, and compute resources.

Cloud for most work, local for sensitive data

Use cloud AI for general tasks and high throughput
Small teams run desktop servers (NVIDIA DGX Spark, Mac Mini) for restricted data
Not built for heavy workloads, but keeps critical data on-site

Challenge: Governance complexity. You need clear policies about which models are approved for which data sources, and who enforces that.

The model behind the demo

Gemini 2.5 Flash

The demo used Google's Gemini 2.5 Flash, a cloud-hosted large language model (LLM). Remember Nathan's earlier explanation: LLMs predict the next token based on patterns in their training data. No understanding, no memory. Just very capable pattern matching.

That's how it could read your column names, understand your question in plain English, and write working SQL in response, all without being explicitly programmed for your dataset.

There are a lot of AI models, but

Which is the most important one?

A Large Language Model A neural network trained on massive text corpora to predict and generate human-like language. B Generative Pre-trained Transformer The architecture behind GPT. Uses self-attention to process sequences and generate text token by token. C Convolutional Neural Network (CNN) Specialized for image and spatial data. Uses filters to detect patterns like edges, shapes, and objects in images. D Generative Adversarial Network (GAN) Two neural networks compete. One generates fakes, the other detects them, producing increasingly realistic synthetic data. E Diffusion Model (DALL-E, Stable Diffusion) Learns to generate images by gradually removing noise. The engine behind modern AI image and video generation. F Recurrent Neural Network (RNN/LSTM) Processes sequences by maintaining memory of previous inputs. Used for time series, early language models, and speech recognition. G BERT / Encoder Model Reads text bidirectionally to understand meaning and context. Used for classifying text, not generating it. H Reinforcement Learning (RL) Learns by trial and error with rewards. Powers game-playing AI like AlphaGo and robotic control systems. I Random Forest / Decision Tree Classic machine learning. Builds decision rules from data features. Very common in research for classification and prediction. J Support Vector Machine (SVM) Finds the optimal boundary between categories in data. A traditional classification workhorse in science and engineering. K Variational Autoencoder (VAE) Compresses data into a compact representation and generates new samples from it. Used in drug discovery and anomaly detection. L Graph Neural Network (GNN) Learns from relationships between entities. Used for social networks, molecular structures, and knowledge graphs.

Wrong!

The most important model is your data model.

AI models come and go, but the way an institution organizes, governs, and shares its data determines whether any of them are useful.

Foundations

What is a data model?

Definition

The blueprint for how your institution stores information

A data model defines the tables, columns, relationships, and rules that describe your data. It is the difference between a filing cabinet with labeled folders and a pile of papers on the floor.

What you already have

Your systems already have data models

Your post-award system has one
Your ERP has another
Your HR system has a third
They all describe the same things differently

Shared standards

What is a Universal Data Model?

Definition

A shared data standard across institutions

A UDM is a data model that is shared across institutions. It defines a common set of tables and columns so that every institution can translate its own data into one shared language.

Sound familiar?

Every institution speaks a different data language

Your post-award system calls it AGENCY_NAME. Your ERP calls it VENDOR_DESC. Your spreadsheet calls it "Sponsor." Every system stores the same concept differently, and every integration has to reconcile them from scratch.

A UDM gives every institution one shared language so tools, reports, and AI workflows work the same everywhere.

Shared standards

The AI4RA Universal Data Model

Domain	Tables
Reference	Organization, AllowedValues, BudgetCategory
Core	Personnel, ContactDetails, Project
Pre-Award	RFA, RFARequirement, Proposal, ProposalBudget
Post-Award	Award, Modification, Terms, Subaward, CostShare, and more
Financial	Fund, Account, FinanceCode, Transaction, IndirectRate, Invoice
Personnel & Effort	ProjectRole, Effort
Compliance	ComplianceRequirement, ConflictOfInterest
Operations	ApplicationSystem, ServiceRequest
Faculty Dev	ProjectCohort, CohortParticipation
System	Document, ActivityLog

40 tables, 100 relationships. Build an app on the UDM once and any institution that maps their data to it can use that app immediately.

Shared standards

A predictable, self-describing schema

Naming conventions

Tables: PascalCase e.g. ProjectRole, AwardBudgetPeriod
Columns: Snake_Case e.g. Award_Number, Start_Date
Primary keys: TableName_ID e.g. Award_ID, Personnel_ID
Foreign keys: Named by role e.g. Sponsor_Organization_ID, not just Organization_ID
Booleans: Is_ prefix e.g. Is_Active, Is_Primary

Why this matters

AI and humans can read the schema

Consistent naming means an AI model can infer meaning from column names alone. A developer at a new institution can read the schema without documentation. Every adapter follows the same patterns.

Browse the full model at ui-insight.github.io/AI4RA-UDM

Shared standards

The standards proliferation problem

xkcd.com/927 — CC BY-NC 2.5

AI + Data

AI makes the UDM practical

The hardest part of adopting a shared schema is the translation work. AI changes that equation. It can read a source schema and the UDM, then propose the mapping automatically.

Map each source once instead of building integrations from scratch
AI suggests the mapping; humans review and approve
Build tools once and deploy them at every institution
Every adapter works without AI too. It is a bonus, not a requirement.

Next question

Now how do we handle all that data?

We have a shared data model. Now we need a place to put it.

Infrastructure

Four approaches to institutional data

Structured, governed, fast queries
Pre-defined schemas enforce data quality
Great for reporting dashboards and compliance queries
Rigid schemas are expensive and slow to change
Cannot handle unstructured data (PDFs, images, JSON)

Best for: stable reporting needs with predictable schemas.

Flexible raw storage, any format, cheap
Store everything first, decide structure later
Great for archiving raw files and ML training data
No schema enforcement or query optimization
Without governance, becomes a "data swamp"

Supports structured, semi-structured, and unstructured data natively.

Flexible storage + structured governance layers on top
Open table formats (Iceberg) enable ACID transactions and schema evolution
Medallion architecture transforms raw data into governed views
Separation of storage and compute keeps costs down
Works for structured, semi-structured, and unstructured data

The architecture the GRANTED project chose for university research data.

What happens when a data lake loses governance
Data gets dumped in with no schema, no ownership, no documentation
Nobody knows what's in there, how old it is, or whether to trust it
Queries return results, but nobody can verify if they're correct
Every new project starts by re-discovering and re-cleaning the same data

This is the default outcome without deliberate governance. Most institutions are here.

Why a lakehouse

Capabilities of a data lakehouse

Trust & Governance

Who changed what, when, and why?

Data Quality

Schemas enforced, anomalies detected, freshness monitored.

Reliability

Failover, backup, point-in-time recovery.

Scalability

Storage and compute scale independently.

Openness

Open formats, interchangeable engines, no lock-in.

Multi-Modal

Structured, semi-structured, and unstructured in one platform.

Simplicity

A small team can operate it without dedicated platform engineers.

Cost Efficiency

No redundant copies, compute scales to actual workload.

Architecture

The GRANTED data lakehouse

Data Flow Bronze Silver Platinum Live Writeback

Infrastructure

Open technology, interchangeable parts

Technology choices

Every layer is replaceable

Ingestion: Airbyte, Fivetran, custom scripts, or our own sync engine
Storage: MinIO, AWS S3, PostgreSQL, or Snowflake
Query: Trino, DuckDB, Spark, or Dremio
Applications: Tableau, Power BI, LLM agents, or custom apps

Open table formats (Apache Iceberg) mean no vendor lock-in. Swap any component without rewriting the stack.

Implementation patterns

Two paths to a governed data layer

Medallion (recommended): Full Bronze → Silver → Gold → Platinum pipeline. Best for multi-source, multi-institution use cases.
Direct database: Connect existing databases and map tables directly to UDM views. Faster to start, but harder to govern at scale.

Architecture

The medallion layers

Bronze

Raw data, exactly as received. No transformation.

→

Silver

Per-source mapping to UDM schema. One view per source per table.

→

Gold

Unified UDM tables. UNION ALL across Silver, deduped by primary key.

→

Platinum

App-specific views. Joins, aggregations, access control.

Store everything. Retain raw files, soft deletes, and metadata. The audit trail starts here.

Translate once per source. AI or human maps each column to the UDM. No business logic, only standardization.

One truth. If two sources provide the same entity, priority rules pick the winner. Auto-generated by the platform.

Purpose-built. Each app gets exactly the view it needs. Dashboards, AI agents, and APIs all read from here.

Worked example

Following data through the layers

Tracing VERAS AGENCY_TYPE from source to application.

Bronze

AGENCY_TYPE = 'Federal'
Raw, exactly as received from the VERAS API. No transformation.

Silver

Organization_Type = 'Sponsor'
CASE WHEN maps 'Federal' to UDM enum 'Sponsor'. Column renamed to UDM standard.

Gold

Same value, now part of Gold.UDM.Organization. If Banner also provides this org, dedup picks the higher-priority source.

Platinum

Joined with Award data: "All active awards from Federal sponsors." Flat, access-controlled, ready for a dashboard or AI workflow.

The payoff

What well-organized data lets you do

With governed data

Ask the same question at any institution and get a comparable answer
Build AI workflows that cite their sources and can be audited
Generate reproducible reports that match across departments
Automate routine tasks with confidence in the underlying data

Without it

AI blends contradictory sources into confident-sounding wrong answers
Every new report requires a new integration from scratch
Nobody knows which spreadsheet is current or authoritative
Automation multiplies errors instead of eliminating them

End of module

Back to Sessions

Return to session list

The data lakehouse and data organization

Data Lakehouse Module Preview

AI-Assisted EDA

Data Models & the UDM

The Data Lakehouse

Break

Let's explore a dataset together using AI-assisted analysis.

What did you notice?

AI analyzes what you give it

Cloud vs. locally hosted models

Gemini 2.5 Flash

Which is the most important one?

Wrong!

What is a data model?

The blueprint for how your institution stores information

Your systems already have data models

What is a Universal Data Model?

A shared data standard across institutions

Every institution speaks a different data language

The AI4RA Universal Data Model

A predictable, self-describing schema

AI and humans can read the schema

The standards proliferation problem

AI makes the UDM practical

Now how do we handle all that data?

Four approaches to institutional data

Capabilities of a data lakehouse

Trust & Governance

Data Quality

Reliability

Scalability

Openness

Multi-Modal

Simplicity

Cost Efficiency

The GRANTED data lakehouse

Open technology, interchangeable parts

Every layer is replaceable

Two paths to a governed data layer

The medallion layers

Following data through the layers

What well-organized data lets you do

Back to Sessions

Sample datasets for Data Crawler Carl