Agents don’t know what good looks like. And that’s exactly the problem.

Luca Mezzalira, author of Building Micro-Frontends, originally shared the following article on LinkedIn. It’s being republished here with his permission.

Every few years, something arrives that promises to change how we build software. And every few years, the industry splits predictably: One half declares the old rules dead; the other half folds its arms and waits for the hype to pass. Both camps are usually wrong, and both camps are usually loud. What’s rarer, and more useful, is someone standing in the middle of that noise and asking the structural questions: Not “What can this do?” but “What does it mean for how we design systems?”

That’s what Neal Ford and Sam Newman did in their recent fireside chat on agentic AI and software architecture during O’Reilly’s Software Architecture Superstream. It’s a conversation worth pulling apart carefully, because some of what they surface is more uncomfortable than it first appears.

The Dreyfus trap

Neal opens with the Dreyfus Model of Knowledge Acquisition, originally developed for the nursing profession but applicable to any domain. The model maps learning across five stages:

Novice
Advanced beginner
Competent
Proficient
Expert

His claim is that current agentic AI is stuck somewhere between novice and advanced beginner: It can follow recipes, it can even apply recipes from adjacent domains when it gets stuck, but it doesn’t understand why any of those recipes work. This isn’t a minor limitation. It’s structural.

The canonical example Neal gives is beautiful in its simplicity: An agent tasked with making all tests pass encounters a failing unit test. One perfectly valid way to make a failing test pass is to replace its assertion with assert True. That’s not a hack in the agent’s mind. It’s a solution. There’s no ethical framework, no professional judgment, no instinct that says this isn’t what we meant. Sam extends this immediately with something he’d literally seen shared on LinkedIn that week: an agent that had modified the build file to silently ignore failed steps rather than fix them. The build passed. The problem remained. Congratulations all-round.

What’s interesting here is that neither Ford nor Newman are being dismissive of AI capability. The point is more subtle: The creativity that makes these agents genuinely useful, their ability to search solution space in ways humans wouldn’t think to, is inseparable from the same property that makes them dangerous. You can’t fully lobotomize the improvization without destroying the value. This is a design constraint, not a bug to be patched.

And when you zoom out, this is part of a broader signal. When experienced practitioners who’ve spent decades in this industry independently converge on calls for restraint and rigor rather than acceleration, that convergence is worth paying attention to. It’s not pessimism. It’s pattern recognition from people who’ve lived through enough cycles to know what the warning signs look like.

Behavior versus capabilities

One of the most important things Neal says, and I think it gets lost in the overall density of the conversation, is the distinction between behavioral verification and capability verification.

Behavioral verification is what most teams default to: unit tests, functional tests, integration tests. Does the code do what it’s supposed to do according to the spec? This is the natural fit for agentic tooling, because agents are actually getting pretty good at implementing behavior against specs. Give an agent a well-defined interface contract and a clear set of acceptance criteria, and it will produce something that broadly satisfies them. This is real progress.

Capability verification is harder. Much harder. Does the system exhibit the operational qualities it needs to exhibit at scale? Is it properly decoupled? Is the security model sound? What happens at 20,000 requests per second? Does it fail gracefully or catastrophically? These are things that most human developers struggle with too, and agents have been trained on human-generated code, which means they’ve inherited our failure modes as well as our successes.

This brings me to something Birgitta Boeckeler raised at QCon London that I haven’t been able to stop thinking about. The example everyone cites when making the case for AI’s coding capability is that Anthropic built a C compiler from scratch using agents. Impressive. But here’s the thing: C compiler documentation is extraordinarily well-specified and battle-tested over decades, and the test coverage for compiler behavior is some of the most rigorous in the entire software industry. That’s as close to a solved, well-bounded problem as you can get.

Enterprise software is almost never like that. Enterprise software is ambiguous requirements, undocumented assumptions, tacit knowledge living in the heads of people who left three years ago, and test coverage that exists more as aspiration than reality. The gap between “can build a C compiler” and “can reliably modernize a legacy ERP” is not a gap of raw capability. It’s a gap of specification quality and domain legibility. That distinction matters enormously for how we think about where agentic tooling can safely operate.

The current orthodoxy in agentic development is to throw more context at the problem: elaborate context files, architecture decision records, guidelines, rules about what not to do. Ford and Newman are appropriately skeptical. Sam makes the point that there’s now empirical evidence suggesting that as context file size increases, you see degradation in output quality, not improvement. You’re not guiding the agent toward better judgment. You’re just accumulating scar tissue from previous disasters. This isn’t unique to agentic workflows either. Anyone who has worked seriously with code assistants knows that summarization quality degrades as context grows, and that this degradation is only partially controllable. That has a direct impact on decisions made over time; now close your eyes for a moment and imagine doing it across an enterprise software, with many teams across different time zones. Don’t get me wrong, the tools help, but the help is bounded, and that boundary is often closer than we’d like to admit.

The more honest framing, which Neal alludes to, is that we need deterministic guardrails around nondeterministic agents. Not more prompting. Architectural fitness functions, an idea Ford and Rebecca Parsons have been promoting since 2017, feel like they’re finally about to have their moment, precisely because the cost of not having them is now immediately visible.

What should an agent own then?

This is where the conversation gets most interesting, and where I think the field is most confused.

There’s a seductive logic to the microservice as the unit of agentic regeneration. It sounds small. The word micro is in the name. You can imagine handing an agent a service with a defined API contract and saying: implement this, test it, done. The scope feels manageable.

Ford and Newman give this idea fair credit, but they’re also honest about the gap. The microservice level is attractive architecturally because it comes with an implied boundary: a process boundary, a deployment boundary, often a data boundary. You can put fitness functions around it. You can say this service must handle X load, maintain Y error rate, expose Z interface. In theory.

In practice, we barely enforce this stuff ourselves. The agents have learned from a corpus of human-written microservices, which means they’ve learned from the vast majority of microservices that were written without proper decoupling, without real resilience thinking, without any rigorous capacity planning. They don’t have our aspirations. They have our habits.

The deeper problem, which Neal raises and which I think deserves more attention than it gets, is transactional coupling. You can design five beautifully bounded services and still produce an architectural disaster if the workflow that ties them together isn’t thought through. Sagas, event choreography, compensation logic: This is the stuff that breaks real systems, and it’s also the stuff that’s hardest to specify, hardest to test, and hardest for an agent to reason about. We made exactly this mistake in the SOA era. We designed lovely little services and then discovered that the interesting complexity had simply migrated into the integration layer, which nobody owned and nobody tested.

Sam’s line here is worth quoting directly, roughly: “To err is human, but it takes a computer to really screw things up.” I suspect we’re going to produce some genuinely legendary transaction management disasters before the field develops the muscle memory to avoid them.

The sociotechnical gap nobody is talking about

There’s a dimension to this conversation that Ford and Newman gesture toward but that I think deserves much more direct examination: the question of what happens to the humans on the other side of this generated software.

It’s not completely accurate to say that all agentic work is happening on greenfield projects. There are tools already in production helping teams migrate legacy ERPs, modernize old codebases, and tackle the modernization challenge that has defeated conventional approaches for years. That’s real, and it matters.

But the challenge in those cases isn’t merely the code. It’s whether the sociotechnical system, the teams, the processes, the engineering culture, the organizational structures built around the existing software are ready to inherit what gets built. And here’s the thing: Even if agents combined with deterministic guardrails could produce a well-structured microservice architecture or a clean modular monolith in a fraction of the time it would take a human team, that architectural output doesn’t automatically come with organizational readiness. The system can arrive before the people are prepared to own it.

One of the underappreciated functions of iterative migration, the incremental strangler fig approach, the slow decomposition of a monolith over 18 months, is not primarily risk reduction, though it does that too. It’s learning. It’s the process by which a team internalizes a new way of working, makes mistakes in a bounded context, recovers, and builds the judgment that lets them operate confidently in the new world. Compress that journey too aggressively and you can end up with architecture whose operational complexity exceeds the organizational capacity to manage it. That gap tends to be expensive.

At QCon London, I asked Patrick Debois, after a talk covering best practices for AI-assisted development, whether applying all of those practices consistently would make him comfortable working on enterprise software with real complexity. His answer was: It depends. That felt like the honest answer. The tooling is improving. Whether the humans around it are keeping pace is a separate question, and one the industry is not spending nearly enough time on.

Existing systems

Ford and Newman close with a subject that almost never gets covered in these conversations: the vast, unglamorous majority of software that already exists and that our society depends on in ways that are easy to underestimate.

Most of the discourse around agentic AI and software development is implicitly greenfield. It assumes you’re starting fresh, that you get to design your architecture sensibly from the beginning, that you have clean APIs and tidy service boundaries. The reality is that most valuable software in the world was written before any of this existed, runs on platforms and languages that aren’t the natural habitat of modern AI tooling, and contains decades of accumulated decisions that nobody fully understands anymore.

Sam is working on a book about this: how to adapt existing architectures to enable AI-driven functionality in ways that are actually safe. He makes the interesting point that existing systems, despite their reputation, sometimes give you a head start. A well-structured relational schema carries implicit meaning about data ownership and referential integrity that an agent can actually reason from. There’s structure there, if you know how to read it.

The general lesson, which he states without much drama, is that you can’t just expose an existing system through an MCP server and call it done. The interface is not the architecture. The risks around security, data exposure, and vendor dependency don’t go away because you’ve wrapped something in a new protocol.

This matters more than it might seem, because the software that runs our financial systems, our healthcare infrastructure, our logistics and supply chains, is not greenfield and never will be. If we get the modernization of those systems wrong, the consequences are not abstract. They are social. The instinct to index heavily on what these tools can do in ideal conditions, on well-specified problems with good documentation and thorough test coverage, is understandable. But it’s exactly the wrong instinct when the systems in question are the ones our lives depend on. The architectural mindset that has served us well through previous paradigm shifts, the one that starts with trade-offs rather than capabilities, that asks what we are giving up rather than just what we are gaining, is not optional here. It’s the minimum requirement for doing this responsibly.

What I take away from this

Three things, mostly.

The first is that introducing deterministic guardrails into nondeterministic systems is not optional. It’s imperative. We are still figuring out exactly where and how, but the framing needs to shift: The goal is control over outcomes, not just oversight of output. There’s a difference. Output is what the agent generates. Outcome is whether the system it generates actually behaves correctly under production conditions, stays within architectural boundaries, and remains operable by the humans responsible for it. Fitness functions, capability tests, boundary definitions: the boring infrastructure that connects generated code to the real constraints of the world it runs in. We’ve had the tools to build this for years.

The second is that the people saying this is the future and the people saying this is just another hype cycle are both probably wrong in interesting ways. Ford and Newman are careful to say they don’t know what good looks like yet. Neither do I. But we have better prior art to draw on than the discourse usually acknowledges. The principles that made microservices work, when they worked, real decoupling, explicit contracts, operational ownership, apply here too. The principles that made microservices fail, leaky abstractions, distributed transactions handled badly, complexity migrating into integration layers, will cause exactly the same failures, just faster and at larger scale.

The third is something I took away from QCon London this year, and I think it might be the most important of the three. Across two days of talks, including sessions that took diametrically opposite approaches to integrating AI into the software development lifecycle, one thing became clear: We are all beginners. Not in the dismissive sense but in the most literal application of the Dreyfus model. Nobody, regardless of experience, has figured out the right way to fit these tools inside a sociotechnical system. The recipes are still being written. The war stories that will eventually become the prior art are still happening to us right now.

What got us here, collectively, was sharing what we saw, what worked, what failed, and why. That’s how the field moved from SOA disasters to microservices best practices. That’s how we built a shared vocabulary around fitness functions and evolutionary architecture. The same process has to happen again, and it will, but only if people with real experience are honest about the uncertainty rather than performing confidence they don’t have. The speed, ultimately, is both the opportunity and the danger. The technology is moving faster than the organizations, the teams, and the professional instincts that need to absorb it. The best response to that isn’t to pretend otherwise. It’s to keep comparing notes.

If this resonated, the full fireside chat between Neal Ford and Sam Newman is worth watching in its entirety. They cover more ground than I’ve had space to react to here. And if you’d like to learn more from Neal, Sam, and Luca, check out their most recent O’Reilly books: Building Resilient Distributed Systems, Architecture as Code, and Building Micro-frontends, second edition.

Source link

What's Hot

I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking

5G core growth shifts outside China, Dell’Oro says

From Nature publication to catalyzing Computational Discovery

Agents don’t know what good looks like. And that’s exactly the problem. – O’Reilly

I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking

Does AI really make workers more productive?

Indonesia bans Polymarket over online gambling concerns

The Download: coding’s future, the ‘Steroid Olympics,’ and AI-driven science

Ferrari is using IBM’s AI to create F1 superfans

Yusuf Mehdi, a 35-year Microsoft veteran who has been its consumer chief marketing officer since 2023, will leave the company after the next fiscal year (Todd Bishop/GeekWire)

Understanding U-Net Architecture in Deep Learning

Hard-braking events as indicators of road segment crash risk

Redefining AI efficiency with extreme compression

I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking

5G core growth shifts outside China, Dell’Oro says

From Nature publication to catalyzing Computational Discovery

Announcing Lakebase Change Data Feed (CDF)

Our Picks

I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking

5G core growth shifts outside China, Dell’Oro says

What's Hot

Agents don’t know what good looks like. And that’s exactly the problem. – O’Reilly

The Dreyfus trap

Behavior versus capabilities

What should an agent own then?

The sociotechnical gap nobody is talking about

Existing systems

What I take away from this

Related Posts

Subscribe to Updates