Can an AI Rebuild Your Software?

Your test suite might be the most valuable asset you didn't know you had.

I was sitting on my couch this morning watching a YouTube video about open source software potentially changing forever. The video was from ThePrimeagen — if you haven't watched him, he's one of those creators who manages to make software drama genuinely entertaining. The gist was about companies using AI to rewrite open source projects entirely from scratch, using the existing test suites as their specification. And then I wondered:

"Is there a metric hiding in here — something that measures how much agentic development a project can support, based on how well it's tested and documented. If an agent had nothing but the test suite and the docs, could it rebuild the whole thing from scratch?"

I didn't have a clean framework for this. I just had a feeling that something important was hiding in plain sight. So let me try to work through it with you.

It's Already Happening

Before I get into the idea, let me ground this in what's actually going on — because this isn't theoretical.

Anthropic recently had 16 AI agents build a working C compiler from scratch. In Rust. In two weeks. For about $20,000. No human wrote a single line of compiler code. The agents used GCC's torture test suites as their specification: write code, run the tests, fix what fails, repeat. The result was 100,000 lines of Rust that can compile the Linux kernel across three architectures.

Is it a perfect compiler? No. Critics rightly pointed out that it's not efficient, it's not production-ready, and it was built on the shoulders of decades of human compiler engineering baked into those test suites and the model's training data. Fair points, all of them.

And then just this week, a single Cloudflare engineer used AI to rewrite Next.js — Vercel's flagship framework — on top of Vite. In one week. For $1,100 in tokens. Almost every line of code was written by AI. The engineer ported the existing Next.js test suite directly and used it as the target specification. As he put it, the project worked because of the combination of a well-documented API, a comprehensive test suite, and a model capable enough to handle the complexity. Take any one of those away, and it falls apart.

Now, two examples don't make a trend. I'm not here to declare a movement. But these are hints. And if you've been in this industry long enough, you learn to pay attention to hints — especially when they come from companies operating at this kind of scale.

What both of these projects have in common is the thing I can't stop thinking about: the test suite was the blueprint, and the AI was the builder. The quality of the specification determined the quality of the output.

The Pasta We Made

If you've been building software for any length of time, you know the reality. Most of us don't get to start fresh. We inherit things. We inherit codebases that were built by people who left the company three years ago. We inherit architectural decisions that made perfect sense in 2018 and make absolutely no sense now. We inherit the organizational chart baked directly into the software — because that's exactly what Conway's Law predicts will happen, and it does, every single time.

I've been in this industry long enough to remember when we were making pasta all day in our web applications and loving it. jQuery spaghetti everywhere, CSS floats holding the whole thing together with prayers and !important declarations. And you know what? We shipped product. It worked. Humans are remarkable at holding chaos together with tribal knowledge and Slack threads.

But here's the thing — AI can't do that. AI can't walk over to the engineer who built the original service and ask why that weird edge case handler exists. AI can't pick up on the context from a hallway conversation or read between the lines of a three-year-old Jira ticket. All it has is what's written down.

And for most of our codebases... what's written down isn't much.

Tests as Specification

What Cloudflare and Anthropic are demonstrating — even if we're still in the early innings — is something we've known forever but rarely acted on: tests aren't just a safety net. They're the closest thing we have to a machine-readable specification of what software is supposed to do.

We've been measuring test coverage for years as a quality signal for human development teams. "Get your coverage up." "We need 80% coverage." But what if that same coverage metric is also telling us something completely different? What if it's telling us how ready that piece of software is for an AI to work on it?

Not to replace anyone. Let me be really clear about that. But to handle the routine stuff — the bug fixes, the dependency updates, the feature work that follows well-established patterns — so that the engineers on your team can focus on the problems that actually require a human brain. The architecture decisions. The creative solutions. The stuff that makes this job interesting in the first place.

So What Would You Actually Measure?

Right now, I am offering a rough first pass at naming this, but I've been calling it an "Autonomous Maintainability Index" in my head, and I'll be honest, the name needs work. But the idea is this: for any given service or application in your portfolio, how much of its maintenance could realistically be handled by an AI agent with minimal human oversight?

And it's not just about tests. Think about everything a new engineer needs when they join your team and pick up an unfamiliar service. I've onboarded enough people over the years to know exactly what that list looks like:

Can an agent understand what this software is supposed to do? That's your tests — but not just line coverage. Behavioral tests. Tests that describe intent, not just exercise code paths. If your tests say "this function returns true" but don't explain why it should return true in that context, that's not a spec. That's a smoke test.

Can an agent understand why decisions were made? This is documentation, but not the kind most of us write. I'm talking about architecture decision records, design docs, the stuff that captures the reasoning behind choices. Every time someone on your team says "oh yeah, we did it that way because..." and the answer lives only in their head — that's a gap an AI can't bridge.

Can an agent tell when something is broken? Structured logging, meaningful alerts, health checks. If your on-call engineer needs to SSH into a box and read raw log files to figure out what went wrong at 3am, an AI agent isn't going to fare any better.

Can an agent make changes safely? Feature flags, automated rollback, canary deployments. These are the guardrails that make it safe for anyone to ship changes — human or AI. Without them, you're handing someone the keys to production with no seatbelt.

Can an agent reason about this service in isolation? Clean API contracts, explicit dependencies, minimal hidden coupling. The more your service operates as a well-defined unit, the less an agent needs to understand about the entire ecosystem to do useful work on it.

Here's what I find fascinating about this list: every single one of these things also makes your software better for humans. There's no trade-off here. Investing in AI readiness is just investing in software quality — the kind we've always talked about but often deprioritized because the humans could compensate.

Why I Can't Stop Thinking About This

I lead engineering teams at scale — across multiple portfolios and over a dozen teams. We have a lot of software. A lot. Built over many years, by many people, at many different levels of quality and documentation. Some of it is beautifully tested and well-documented. Some of it needs some love and attention.

I am exploring the transition every brownfield company is facing right now. Engineers are at different points on their AI adoption curve. Some are pair-programming with AI tools every day. Others haven't changed their workflow in years. Infrastructure — CI/CD pipelines, code review processes, deployment gates — all designed around human decision-making. Human speed. Human judgment.

What I don't have, and what I think most engineering leaders don't have, is a shared language for navigating a transition where AI is playing a larger role in the development process. Something that lets me look at a service and say "this one is ready for AI-assisted development" or "this one needs investment before we can get there." Something that gives my teams a roadmap instead of a vague directive to "adopt AI."

That's what I think this metric and this line of thinking can reveal to us. Not a silver bullet — I've been around long enough to be deeply suspicious of anyone selling silver bullets — but a practical set of tools and language for having honest conversations about where we are and where we need to go - or better yet what opportunities are ahead of us.

Imagine coding and responding instantly, being in all places at once. Isn't that a fascinating concept?

What's Next

I'm going to keep pulling on this thread. In my next post, I want to zoom out from individual services and look at the bigger picture — the portfolio level, the organizational level, the human level. Because the software is one component of scaled solutions and businesses that deploy and rely on software to exist or provide their value proposition. The people and the systems they built around themselves. Conway's Law doesn't just describe how software gets organized. It describes how change gets organized. And if we're going to navigate the shift from human-driven to AI-assisted development, we need to talk about all of it.

I'd love to hear what you think. Are you looking at your codebases through this lens? Have you found ways to measure readiness that work? Let's have a conversation about it.

Cheers,

~ John

Subscribe to Leadership Redefined: Master Adaptation & Conscious Strategies

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe