Data Crawler Carl

Upload CSV. New achievement! Ask questions. Get answers.

How Data Crawler Carl Works

Architecture

Data Crawler Carl is a static website hosted on GitHub Pages. There is no backend server, no database server, no application server — the entire application is a collection of HTML, CSS, and JavaScript files served as static assets. Everything runs in your browser. The source code is open source on GitHub.

Why this architecture? By running everything client-side, we get a fully dynamic AI-powered application hosted for free with zero infrastructure. There are no servers to maintain, no scaling concerns, and critically — no concurrency issues. During a workshop with 50+ participants all using the tool simultaneously, each person's browser runs its own independent instance. There is no shared server that could become a bottleneck or crash under load.

The AI Model

Carl uses Google Gemini 2.5 Flash, a large language model (LLM). LLMs work by predicting the most likely next token (word/symbol) given context — they are sophisticated pattern-matching systems trained on vast amounts of text. The browser communicates directly with the Google Generative AI API using the @google/genai JavaScript SDK (loaded from an ESM CDN). API calls go straight from your browser to Google — no proxy, no middleware.

You provide your own API key, which is stored in sessionStorage (browser session memory). The key is never written to disk, never sent to any server other than Google's API, and is automatically cleared when you close the tab.

CSV → SQLite Pipeline

When you upload a CSV file, three JavaScript libraries work together to make it queryable:

Step 1 — PapaParse reads the raw CSV text and converts it into structured JavaScript arrays. It auto-detects delimiters, handles quoted fields, infers column types (string, number, boolean), and streams large files efficiently. This all runs client-side — the file is read with the browser's FileReader API and never uploaded anywhere. The output is an array of column names and an array of row arrays.

Step 2 — sql.js is the full SQLite database engine compiled from its original C source code to WebAssembly (Wasm). WebAssembly is a binary instruction format that lets compiled native code run at near-native speed inside the browser's sandbox. When sql.js loads, it initializes a complete SQLite instance in memory. We call db.run() to create a table with the detected column names and types, then insert each parsed row using parameterized SQL statements (INSERT INTO data VALUES (?, ?, ...)).

Step 3 — Querying: Once loaded, the in-memory SQLite database supports the full SQL language — SELECT, GROUP BY, JOIN, subqueries, aggregate functions, and more. Each query is executed by calling db.exec(sql), which returns column names and row data as JavaScript arrays. There are no network calls; the entire query engine runs locally in the browser's WebAssembly runtime.

The Two-Round Conversation — Minimizing Hallucinations

LLMs are prone to hallucination — confidently generating plausible-sounding but factually incorrect information. This is especially dangerous in data analysis, where a model might invent statistics, fabricate trends, or report numbers that don't exist in the dataset. A naive approach (sending all the data to the AI and asking it to analyze) produces results that look authoritative but may be partly or entirely made up.

Carl's two-round architecture is designed to minimize (but not eliminate) this problem by separating what to compute from computing it:

Round 1 — AI writes SQL: Your question is sent to Gemini along with only the column names and a 5-row sample. Gemini's job is strictly to write SQL queries — it never sees the full dataset and is explicitly instructed not to guess at results. The SQL is executed locally against your in-browser SQLite database, producing real, verified results.

Round 2 — AI analyzes real data: The actual query results (real numbers from your database) are sent back to Gemini. Only now does it provide analysis, insights, and chart specifications — grounded in data that was computed locally, not predicted.

This separation means every number you see in a query result table was computed by SQLite on your actual data. The AI's role is reduced to two things it's good at: translating natural language into SQL and interpreting real results in plain English.

Limitations: This approach does not eliminate hallucination entirely. Gemini can still write incorrect SQL (wrong joins, bad filters, misunderstood column semantics), misinterpret results in its prose, or draw conclusions the data doesn't support. The two-round design makes these errors auditable — you can always see the SQL that ran and the actual results it produced — but critical findings should still be verified independently.

Safe Chart Rendering with Plotly.js

Plotly.js is an open-source JavaScript charting library. Normally you call Plotly.newPlot(element, traces, layout) with arrays of data. Carl uses a secure pipeline that separates the AI from the data and from the rendering:

1. Gemini returns a spec: A declarative JSON object with a chart type, a SQL query, and column mappings — e.g. {"type": "bar", "sql": "SELECT Dept, AVG(Salary) as avg FROM data GROUP BY Dept", "columns": {"x": "Dept", "y": "avg"}}. Gemini never provides the actual data values.

2. We execute the SQL: The chart's SQL query is run against the local SQLite database using db.exec(). The result columns are mapped to Plotly trace properties (x, y, labels, values, etc.) based on the column mappings.

3. We build the Plotly trace: A dedicated buildTrace() function constructs the Plotly trace object. Only allowlisted properties are copied — the function has a hard-coded set of safe fields per chart type. All string values pass through sanitizeString() (strips HTML tags) and all arrays pass through sanitizeArray() (rejects anything that isn't a primitive value).

4. We call Plotly: The sanitized trace and a layout object (constructed by us, not Gemini) are passed to Plotly.newPlot(). The chart types are restricted to an allowlist: bar, scatter, line, pie, histogram, box, and heatmap. There is no eval(), no Function() constructor, no dynamic script injection at any point in this pipeline.

What Gets Sent to Google

Column names and a 5-row sample of your CSV (so Gemini understands the schema).
Your natural-language questions and the conversation history.
Actual query results in Round 2 (so Gemini can analyze them).
Your API key (sent directly to Google's API endpoint for authentication).

The full dataset is never uploaded. All SQL execution, all chart data extraction, and all rendering happen locally.

Security Summary

No server. Static GitHub Pages site — no backend, no database, no server-side code.
No persistent storage. API key in sessionStorage only — cleared on tab close. No cookies, no localStorage.
No code execution. Charts rendered from validated JSON specs. Property allowlists per chart type. All strings sanitized (HTML tags stripped).
No tracking. No cookies, no analytics, no third-party scripts beyond the Gemini SDK and Plotly.js.