This is a workshop demo for exploring CSV data with AI. The app runs entirely in your browser — no data is stored on any server and the full dataset never leaves your device.
However, your questions, column names, a 5-row sample of your data, and actual SQL query results are sent to Google Gemini for processing. Anything you ask Carl about passes through Google's infrastructure.
Do not upload sensitive, confidential, proprietary, or personally identifiable data unless your institution has an enterprise-grade data processing agreement with Google and you have independently reviewed how that data is handled in flight. When in doubt, use synthetic, anonymized, or public example data.
Your key is stored only in your browser's session memory — it is never saved to disk or sent anywhere except directly to Google's API. It is cleared when you close the tab.
Workshop facilitators: Pre-fill keys for attendees by sharing a link with ?key=YOUR_KEY appended.
Data Crawler Carl is a static website hosted on GitHub Pages. There is no backend server, no database server, no application server — the entire application is a collection of HTML, CSS, and JavaScript files served as static assets. Everything runs in your browser. The source code is open source on GitHub.
Why this architecture? By running everything client-side, we get a fully dynamic AI-powered application hosted for free with zero infrastructure. There are no servers to maintain, no scaling concerns, and critically — no concurrency issues. During a workshop with 50+ participants all using the tool simultaneously, each person's browser runs its own independent instance. There is no shared server that could become a bottleneck or crash under load.
Carl uses Google Gemini 2.5 Flash, a large language model (LLM). LLMs work by predicting the most likely next token (word/symbol) given context — they are sophisticated pattern-matching systems trained on vast amounts of text. The browser communicates directly with the Google Generative AI API using the @google/genai JavaScript SDK (loaded from an ESM CDN). API calls go straight from your browser to Google — no proxy, no middleware.
You provide your own API key, which is stored in sessionStorage (browser session memory). The key is never written to disk, never sent to any server other than Google's API, and is automatically cleared when you close the tab.
When you upload a CSV file, three JavaScript libraries work together to make it queryable:
Step 1 — PapaParse reads the raw CSV text and converts it into structured JavaScript arrays. It auto-detects delimiters, handles quoted fields, infers column types (string, number, boolean), and streams large files efficiently. This all runs client-side — the file is read with the browser's FileReader API and never uploaded anywhere. The output is an array of column names and an array of row arrays.
Step 2 — sql.js is the full SQLite database engine compiled from its original C source code to WebAssembly (Wasm). WebAssembly is a binary instruction format that lets compiled native code run at near-native speed inside the browser's sandbox. When sql.js loads, it initializes a complete SQLite instance in memory. We call db.run() to create a table with the detected column names and types, then insert each parsed row using parameterized SQL statements (INSERT INTO data VALUES (?, ?, ...)).
Step 3 — Querying: Once loaded, the in-memory SQLite database supports the full SQL language — SELECT, GROUP BY, JOIN, subqueries, aggregate functions, and more. Each query is executed by calling db.exec(sql), which returns column names and row data as JavaScript arrays. There are no network calls; the entire query engine runs locally in the browser's WebAssembly runtime.
LLMs are prone to hallucination — confidently generating plausible-sounding but factually incorrect information. This is especially dangerous in data analysis, where a model might invent statistics, fabricate trends, or report numbers that don't exist in the dataset. A naive approach (sending all the data to the AI and asking it to analyze) produces results that look authoritative but may be partly or entirely made up.
Carl's two-round architecture is designed to minimize (but not eliminate) this problem by separating what to compute from computing it:
Round 1 — AI writes SQL: Your question is sent to Gemini along with only the column names and a 5-row sample. Gemini's job is strictly to write SQL queries — it never sees the full dataset and is explicitly instructed not to guess at results. The SQL is executed locally against your in-browser SQLite database, producing real, verified results.
Round 2 — AI analyzes real data: The actual query results (real numbers from your database) are sent back to Gemini. Only now does it provide analysis, insights, and chart specifications — grounded in data that was computed locally, not predicted.
This separation means every number you see in a query result table was computed by SQLite on your actual data. The AI's role is reduced to two things it's good at: translating natural language into SQL and interpreting real results in plain English.
Limitations: This approach does not eliminate hallucination entirely. Gemini can still write incorrect SQL (wrong joins, bad filters, misunderstood column semantics), misinterpret results in its prose, or draw conclusions the data doesn't support. The two-round design makes these errors auditable — you can always see the SQL that ran and the actual results it produced — but critical findings should still be verified independently.
Plotly.js is an open-source JavaScript charting library. Normally you call Plotly.newPlot(element, traces, layout) with arrays of data. Carl uses a secure pipeline that separates the AI from the data and from the rendering:
1. Gemini returns a spec: A declarative JSON object with a chart type, a SQL query, and column mappings — e.g. {"type": "bar", "sql": "SELECT Dept, AVG(Salary) as avg FROM data GROUP BY Dept", "columns": {"x": "Dept", "y": "avg"}}. Gemini never provides the actual data values.
2. We execute the SQL: The chart's SQL query is run against the local SQLite database using db.exec(). The result columns are mapped to Plotly trace properties (x, y, labels, values, etc.) based on the column mappings.
3. We build the Plotly trace: A dedicated buildTrace() function constructs the Plotly trace object. Only allowlisted properties are copied — the function has a hard-coded set of safe fields per chart type. All string values pass through sanitizeString() (strips HTML tags) and all arrays pass through sanitizeArray() (rejects anything that isn't a primitive value).
4. We call Plotly: The sanitized trace and a layout object (constructed by us, not Gemini) are passed to Plotly.newPlot(). The chart types are restricted to an allowlist: bar, scatter, line, pie, histogram, box, and heatmap. There is no eval(), no Function() constructor, no dynamic script injection at any point in this pipeline.
The full dataset is never uploaded. All SQL execution, all chart data extraction, and all rendering happen locally.
An AI-powered CSV data explorer that runs entirely in your browser. Upload any dataset, ask questions in plain English, and get back SQL queries, charts, and analysis.
eval() or arbitrary code runs.Part of the REACH 2026 AI4RA Workshop. Facilitators can pre-fill keys by sharing links with ?key=YOUR_KEY.
View on GitHub — Licensed under the GNU GPL v3.0.