The Brownfield Problem: Why Most AI Development Advice Ignores Your Actual Codebase

New research confirms what enterprise engineers already suspect — AI reliability is a governance problem, not a capability problem.

A couple of weeks ago I wrote about the Autonomous Maintainability Index — this idea that your test suite and documentation might be the most important predictors of how much AI can realistically help with your codebase. Then I followed it up with What If the Lights Go Out? — my honest attempt to sit with the dark factory concept without dismissing it or drinking the Kool-Aid.

Since then, a research paper landed on my desk that put language around something I've been feeling but couldn't quite articulate. And I think it connects the dots between those two posts in a way I wasn't expecting.

The Paper That Made Me Stop Scrolling

Researchers at Florida International University and the University of Florida published a paper called "A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development." I know — the title alone could put you to sleep. But stay with me, because what they found matters.

They were trying to use AI agents to refactor a 2,265-line monolithic JavaScript application. A legacy WebGIS tool built for coastal climate research — sea level rise modeling, interactive maps, data visualization. The kind of thing that started as a quick research prototype and then became the production system because that's what always happens.

Here's the part that caught my attention: the original codebase was built by one developer and a principal investigator who had no formal training in software development, both working part-time. Sound familiar? Because if I'm being honest, that description applies to a startling amount of enterprise software I've encountered over two decades. Maybe not the "no formal training" part — though sometimes — but definitely the "built under constraints by a small team that prioritized shipping over architecture" part.

That's brownfield. That's the world most of us actually live in.

The Advice Gap

If you've been following the AI-assisted development conversation — and at this point, it's hard not to — you've probably noticed something. Almost all of the exciting demos, the viral tweets, the conference talks? They're greenfield. Start from scratch. Clean room. Here's a spec, here's a test suite, go build it.

Anthropic's C compiler from scratch. Cloudflare's Next.js rewrite. These are remarkable achievements, and I wrote about them because they are. But they share a common luxury: they got to start over. The AI didn't have to understand why that weird edge case handler exists on line 847. It didn't have to navigate the global variables named tr_data that three different functions mutate. It didn't have to figure out that the reason the map breaks when you change the sea level slider is because someone hardcoded a tidal datum offset directly into a DOM event listener six years ago.

The FIU researchers didn't have that luxury. They had a monolithic codebase with global state vulnerability, hardcoded domain logic, tightly coupled libraries, and minimal documentation. In other words, they had a Tuesday.

Five Problems That Sound Like My Weekly Stand-Up

The researchers cataloged five fundamental LLM limitations they hit when trying to use AI agents on this codebase. Reading them felt like someone had been sitting in my engineering meetings:

Context overflow. The 2,265-line file exceeded the effective attention range of the model. It could technically fit in the context window, but comprehension degraded as the file got larger. This is the brownfield tax — your legacy code doesn't come in neat, modular pieces that an AI can reason about independently.

Cross-session forgetting. The refactoring spanned multiple sessions over days. The AI couldn't remember what it decided yesterday. Every new session was a fresh start, requiring re-explanation of architectural decisions that a human engineer would just know. I've watched this happen in real-time with our own teams using AI tools — the agent produces beautiful work in session one, then contradicts itself in session two.

Output stochasticity. Same task, different results every time. The agent would handle coordinate reference systems one way in one module and a completely different way in another. For a GIS application, that's the difference between your map working and your data rendering at "Null Island" — latitude zero, longitude zero, somewhere in the Gulf of Guinea. For those of us building travel platforms, the equivalent is an agent that handles date formatting differently across booking flows. Users notice.

Instruction-following failure. The AI treated explicit constraints as suggestions. Domain-specific standards — precise scientific values, exact DOM element IDs, accessibility requirements — got "normalized" or quietly ignored as the context grew. The model rounded exact sea level rise thresholds. It renamed element IDs to be "cleaner." It dropped accessibility attributes during refactoring. All things that would break the application in ways that aren't immediately obvious.

Adaptation rigidity. The only way to improve the agent's behavior was fine-tuning, which takes weeks and produces opaque changes you can't easily audit or roll back. For a team that needs to ship this quarter, that's not a real option.

If you're leading engineering teams, you've seen every one of these. You might not have had this vocabulary for them, but you've felt them. And here's the thing — these aren't problems that GPT-6 or Claude 5 or whatever comes next will magically solve. The researchers used gpt-5.2. The model wasn't the bottleneck. The structure around the model was.

The Fix Isn't a Smarter Model

This is where the paper gets interesting. Instead of waiting for a better model, they built a governance framework around the existing one. They call it the "dual-helix" — two intertwined axes that stabilize what the AI produces.

The first axis is knowledge externalization. Take everything the AI needs to know — the architectural patterns, the domain rules, the project-specific context that currently lives in your engineers' heads — and put it into a persistent, version-controlled knowledge graph. Not a prompt. Not a README that the model might ignore. A structured substrate that the agent must reference.

The second axis is behavioral enforcement. Take your non-negotiable constraints — your coding standards, your accessibility requirements, your naming conventions, your deployment rules — and make them executable protocols, not suggestions. Before the agent can act, it has to validate its plan against these constraints. If it fails validation, it doesn't execute.

The third piece — which they call "skills" — is where knowledge and behavior intersect to form validated, repeatable workflows. Not "generate code" but "generate a module that follows this specific pattern, respects these specific constraints, and produces this specific output format."

Here's what happened: the governed agent refactored the monolith into six cohesive ES6 modules, cut cyclomatic complexity by 51%, and improved the maintainability index by 7 points. And when they ran a controlled experiment comparing the governed agent against an unguided agent and a traditional prompt-engineered agent — all using the same model — the governed agent didn't just perform better. It performed more consistently. The standard deviation across trials dropped by more than half.

That last part is the finding that matters most for anyone running real engineering teams. A system that reliably produces good results beats a system that occasionally produces brilliant results but also occasionally breaks your application. Consistency is what lets you trust a process. And trust is what lets you scale it.

What This Means for the Brownfield World

Connecting this back to the Autonomous Maintainability Index I wrote about earlier. I asked: for any given service in your portfolio, how much of its maintenance could realistically be handled by an AI agent with minimal human oversight? The factors I identified — test coverage, documentation quality, observability, deployment safety, service isolation — those are all knowledge externalization concerns. They're about making implicit knowledge explicit so an agent can use it.

What I was missing — and what this research fills in — is the behavioral enforcement layer. It's not enough to give the AI good information. You have to constrain what it does with that information. Your coding standards, your architectural patterns, your compliance requirements — these can't be advisory. They have to be structural.

Think about it in manufacturing terms. A car factory doesn't work because the robots are smart. It works because the assembly line, the quality gates, the tolerances, and the jigs constrain what the robots can do at each step. The intelligence of the robot matters, sure. But the process design is what makes the factory reliable.

Software engineering has been running on human judgment as its primary quality gate for decades. Code review. Architecture review. Senior engineers who just know that you don't do it that way. That works when humans are doing all the work. But if AI is going to take on a larger share of the development workload — and I think it will — then we need to encode that judgment into something the AI can actually use.

The Question Nobody's Asking

Here's what I keep coming back to, and what the paper quietly sidesteps. In their case study, a human researcher served as the "Agent Builder" — the person who constructed and maintained the knowledge graph, defined the behavioral constraints, and validated the skill workflows. The AI was the "Domain Expert" — it did the work. But the human designed the system that the AI worked within.

That's a new job. Or more accurately, it's a new dimension of an existing job. Who builds and maintains the governance substrate? Who encodes your architectural standards into executable constraints? Who reviews the knowledge graph when the AI discovers new patterns and wants to persist them?

At a small scale, it's the tech lead. At enterprise scale? I'm not sure we've figured that out yet. But I'm pretty confident it's the highest-leverage investment an engineering organization can make right now. Because the alternative is to throw more powerful models at ungoverned brownfield codebases and hope for the best. And hope, as they say, is not a strategy.

Where This Is Going

I started this series asking whether an AI could rebuild your software. The answer, it turns out, depends less on the AI than I initially thought. It depends on what you've built around it.

Your test suite defines what the software should do. Your documentation explains why decisions were made. Your governance structure constrains how the work gets done. Without all three, you're running a powerful engine with no steering wheel.

For those of us in the brownfield world — and let's be honest, that's most of us — the path forward isn't waiting for smarter models. It's investing in the structures that make current models reliable. It's less glamorous than a viral demo of an AI building an app from scratch. But it's the work that actually matters for the teams shipping software tomorrow morning.

I'll keep pulling on this thread. Next, I want to dig into what a practical governance framework looks like at portfolio scale — not for a single WebGIS application, but for an organization with dozens of teams, hundreds of services, and twenty years of accumulated decisions living in people's heads.

If you're navigating this transition too, I'd love to hear how you're thinking about it.

Cheers,

~ John

Subscribe to Leadership Redefined: Master Adaptation & Conscious Strategies

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe