(Crossposting with LinkedIn)
In a recent essay, I argued that “Vibe Analysis” couldn’t be a real thing—that analysis is too different from traditional programming for AI to fully replace human analysts. But as I noted in that essay, this does not mean AI has no place in analytic work. Far from it. AI has enormous potential to accelerate analytic programming; it simply must be applied cautiously, with careful attention to quality, rigor, and security.
Security—and how to enable analysts to work with AI in a secure way—is the primary focus here. It is a central aspect of my work at Urban Labs. While much of our analytic code is intended for public release, the same cannot be said of our data. In fact, it is rarely our data to begin with: the vast majority consists of administrative data entrusted to us by partners. As a result, we are obligated—ethically, legally, and reputationally—to safeguard that data and use it only in contractually specified ways. Each dataset is governed by agreements that precisely define who may access it and how it can be used.
Unsurprisingly, none of these agreements permit sending partner data to Anthropic or OpenAI. Nearly all require that data be stored and processed on servers owned and maintained by the University of Chicago. These environments adhere to strict security standards that ensure partners can trust their data is protected and used only for agreed purposes. Installing new applications in this environment is already difficult; installing an AI coding agent is simply not feasible.
Each analyst does have a local machine used to connect to the secure server, and in theory, these could connect to AI tools. However, for analytic purposes, these machines function largely as “thin clients”—machines that store no analytic code or data. Even so, they may still contain sensitive business information. Allowing AI agents to run freely on them raises legitimate security concerns.
This creates a fundamental problem: we have no safe place to run LLM-based tools, and we cannot expose our data to them even if we did. Compounding this is the reality of how analysts work. Our team—and every analyst I’ve worked with—develops code interactively and iteratively: write a small piece, run it, inspect results, revise, repeat. Writing an entire analytic pipeline in one pass without repeatedly validating against real data is not how analysis works in practice.
Taken together, these constraints make a straightforward AI coding workflow impossible:
- Constraint 1: Our data is private and can never leave the secure server environment, including being sent to external LLMs.
- Constraint 2: Our servers must meet strict security standards and cannot host LLM tools like Claude Code or Codex.
- Constraint 3: Local machines may contain sensitive information and cannot safely run unrestricted AI agents.
- Constraint 4: Analysts work iteratively, and any solution must support that workflow.
So how do we proceed? How can analysts benefit from AI without compromising security or disrupting how they actually work?
I don’t have a final answer, but I believe I do have a workable approach built on four components: LLM coding agents, containers, Git, and synthetic datasets. Together, these can form a secure and practical AI-assisted analytic environment.
At a high level, the approach works as follows: an LLM coding agent runs inside an isolated container. The LLM has access to project code via Git, but not to real data, which never leaves the secure server. Instead, the LLM interacts with synthetic datasets that mimic the structure of real data. This setup allows analysts to collaborate with AI during development, then push finalized code to the secure server for execution on real data.
Breaking this down further:
The process begins with containers. Containers allow you to create isolated computing environments—effectively “computers within computers.”1 The container is completely separated from the host system. You define its operating system and installed software, and that’s all it can access. Running the LLM inside a container limits its scope entirely. It cannot read arbitrary files, execute unknown programs, or affect the host system—because, from its perspective, those resources do not exist.
This isolation creates a safe sandbox, but it raises a new question: how does the LLM do anything useful without access to real resources?
This is where Git comes in. Git is one of the tools installed inside the container. By cloning a repository into the container, the LLM gains access to the project’s codebase—allowing it to read, modify, and suggest improvements. Crucially, this includes only code, not data (and if you are storing data in Git, that is a separate problem). From the LLM’s perspective, the data still does not exist.
However, code alone is often insufficient context. Many analytic tasks depend on understanding the structure and content of the data itself. Asking an AI to summarize variables in a dataset is impossible if it cannot see those variables. But, as described above, we can’t expose our data to AI.
The solution, I believe, is synthetic data. Synthetic datasets replicate the structure and statistical properties of real data without containing any actual sensitive information2. By providing synthetic data within the container, the LLM gains enough context to be useful while still preserving data security. Analysts can continue their normal iterative workflow—writing, testing, and refining code—using synthetic data alongside the LLM.
When development is complete, the workflow is simple: push the code via Git to the secure server, execute it against real data in the secure environment of that server, and generate results.
This approach does not eliminate all challenges, but it creates a viable path forward. LLMs will not replace analysts, but they will augment them. The responsibility is on us to ensure that augmentation happens in a way that is secure, controlled, and aligned with how analysis is actually done.
- In practice, containers are more complex and offer many additional capabilities, but this captures the essential concept. ↩︎
- Generating high-quality synthetic data can be complex. In this case, however, the requirements are modest—it only needs to approximate the structure and behavior of the real data for development purposes. ↩︎
