Scale Forem: Nick Talwar

6 Things Your AI Agents Need That You're Probably Not Building

Nick Talwar — Tue, 19 May 2026 17:35:06 +0000

The infrastructure that separates agents that demo well from agents that actually run

You would never bring a new hire onto your team without performance feedback, escalation paths, or a way to know when they're struggling. Yet that's exactly how most organizations deploy AI agents. MIT Sloan and BCG's 2025 research found that 76% of executives now describe agents as coworkers rather than tools, but almost none of them are managing agents that way. They ship the agent and move on.

Deciding to call your agents “coworkers” is easy. Setting up the feedback loops, escalation paths, and failure signals that actually make one is where teams stall. It's almost entirely an infrastructure problem, and these are the six pieces most teams skip.

1. Evaluation Frameworks

A working agent and a reliable agent are two different things. Evaluation frameworks give you the ability to measure the difference before your users discover it for you. This means building structured test suites that run against your agent's outputs on a regular cadence, scoring for accuracy, relevance, and task completion across a range of realistic scenarios.

Good evaluation suites include both deterministic checks (did the agent call the right tool with the right parameters?) and judgment-based scoring (was the response actually useful to the person asking?).

The key is that evaluation has to be continuous, running in CI/CD pipelines and against live traffic, because agent behavior shifts as underlying models update and data distributions change. LLMs, the technology that undergirds agents, are at their core probabilistic in nature, which means there is an often opaque statistical distribution that can shift over time, which affects performance and accuracy.

Anthropic's engineering team has written publicly about maintaining evaluation suites as living artifacts, with dedicated teams owning the infrastructure while domain experts contribute tasks and run the tests themselves.

2. Fallback and Escalation Logic

Every agent will encounter situations it cannot handle. The question is whether you've decided in advance what happens next, or whether the agent improvises.

Fallback logic defines the boundaries. When confidence drops below a threshold, when a tool call returns unexpected data, when the task exceeds the agent's defined scope, the system needs a predetermined path. That path might route to a simpler deterministic process, a different model, or a human operator. Escalation logic layers on top of that by adding severity awareness.

Without explicit escalation tiers, every failure gets the same treatment, which means either everything gets flagged (and humans stop paying attention) or nothing does (and real problems slip through). The organizations successfully scaling agents build these paths before deployment, treating them as load-bearing architecture.

3. Monitoring for Drift

AI agents degrade quietly. Model updates, shifts in input data, changes to upstream APIs, seasonal variation in user behavior. Any of these can erode agent performance without triggering a single error.

Drift monitoring tracks the gap between how your agent performed when you validated it and how it performs now. This includes statistical monitoring of output distributions, latency tracking across individual tool calls, and automated quality scoring against baseline benchmarks. In practice, effective drift detection requires capturing baseline metrics during your evaluation phase and then running the same scoring pipeline against production traffic on an ongoing basis. When scores diverge from your baseline by more than an acceptable margin, you have a concrete signal to investigate rather than a vague feeling that things seem off.

4. Human-in-the-Loop Checkpoints

Full autonomy sounds efficient until you realize what it costs when the agent is wrong. Human-in-the-loop checkpoints create structured moments where a person reviews, approves, or redirects agent output before it reaches the end user or triggers a downstream action.

The design challenge is placement. Too many checkpoints and you've built an expensive autocomplete system. Too few and you've handed off accountability to a system that can't actually hold it. The right approach maps checkpoints to consequence.

Low-risk, reversible actions can run autonomously. High-stakes decisions, anything involving money, legal exposure, or customer-facing commitments, need a human gate. As agents take on more complex workflows, these checkpoints also become your training data pipeline. Every human correction is a signal about where the agent needs improvement, but only if you're logging it (which brings us to the next point).

5. Logging for Auditability

When an agent makes a decision, you need to be able to reconstruct exactly how it got there. Full execution logging captures the chain of reasoning, tool invocations, retrieved context, intermediate outputs, and final actions across every run.

This serves three purposes simultaneously:

First, debugging. When something goes wrong, you need the trace, not a guess.

Second, compliance. Regulated industries require demonstrable decision trails, and even unregulated ones are moving in that direction.

Third, improvement. Logged executions become the dataset you use to identify failure patterns, tune prompts, and build better evaluation suites.

The tooling for this has matured significantly. OpenTelemetry-based tracing, structured span capture, and production replay capabilities now exist across multiple frameworks. The infrastructure cost is low relative to the cost of operating an agent you cannot inspect.

6. A Defined Handoff Protocol

Agents rarely operate in isolation. They pass work to other agents, to human operators, to downstream systems, and occasionally back to the user. Every one of those transitions is a potential failure point.
A handoff protocol specifies what information transfers with the task, what context the receiving party needs, what constitutes a successful handoff versus a dropped one, and who owns the outcome after the transition.

This gets more complex in multi-agent systems where one agent's output becomes another agent's input. If the first agent summarizes a customer issue and strips out a critical detail before passing it along, the second agent makes a decision on incomplete information. Neither agent has failed individually, but the system has failed completely.

Without this kind of structural clarity, you get the agent equivalent of a game of telephone. Context gets lost between steps, responsibilities blur, and when something fails mid-workflow, nobody can pinpoint where.

The Management Layer You Can't Skip

These six elements share a common thread. They're all infrastructure that exists to manage the agent after it's built.

The agent itself, the model, the prompts, the tool integrations, that's maybe 40% of what a production deployment actually requires.
The other 60% is the system that keeps the agent honest, visible, and recoverable when things go sideways.

Organizations that treat agent deployment as a build-and-ship exercise will spend the next six months doing manual cleanup on failures they could have prevented. The ones that invest in this management layer first will find that their agents get better over time instead of quietly getting worse.

The technology is mature enough. The question is whether your operational infrastructure is ready to match it.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.
→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.
→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Your Product Doesn't Need GPT-5. And It’s Costing You More Than You Think.

Nick Talwar — Tue, 12 May 2026 12:13:37 +0000

How Fine-Tuned Small Models Outperform Frontier AI for Most Production Workloads

Serving a 7B parameter model costs roughly $0.0004 per 1,000 tokens. A frontier model like GPT-5 charges up to $0.09 for the same volume. That's a 200x spread on per-token cost, and at production scale, it compounds into the kind of line item that makes CFOs start asking uncomfortable questions.

Yet most enterprise AI strategies still start in the same place. Frontier model API, default configuration, build everything on top.

I’ve heard the same reasoning for this decision countless times. The plan is to start here, and optimize later. But "optimize later" rarely happens. The API dependency becomes load-bearing, and switching costs quickly accumulate. More often than not, teams discover much too late that 70-80% of their inference calls are handling structured, repeatable tasks that never needed frontier-class reasoning in the first place. Meanwhile, a fine-tuned small model handles all of it at a fraction of the cost, often with better accuracy on the specific domain, and without the vendor dependency.

The question worth asking before you architect anything isn't "which model is most powerful." It's whether the task even requires that power.

The Compounding Cost Problem

The per-token price gap between frontier and small models tells only part of the story. The real damage happens at volume.

Gartner’s analysis found that agentic AI workflows consume 5 to 30 times more tokens per task than standard chatbot interactions. When your agents are running thousands of structured, repeatable tasks per day, each one burning frontier-priced tokens, monthly inference bills can scale from manageable to alarming before anyone notices. A system handling 50,000 daily agent tasks on frontier APIs accumulates costs that a finance team will eventually flag, and "but the model is really smart" isn't a satisfying answer when 80% of those tasks are pattern execution.

API pricing has dropped significantly. Frontier-quality model costs fell roughly 80% between 2025 and early 2026. But cheaper tokens don't change the underlying architectural mistake. You're still paying for general-purpose reasoning capacity on tasks that need specialized precision. It's the equivalent of provisioning a 256-core cluster to run a cron job.

Where Small Models Win (And Where They Don't)

Small language models, typically under 10 billion parameters, have crossed a performance threshold that changes the production calculus. Research from late 2025 demonstrated that a fine-tuned 350M parameter model outperformed generalist frontier models on structured tool-calling and API orchestration tasks. A 3B parameter model trained on domain-specific data can match frontier accuracy on classification, extraction, and routing while delivering 150 to 300 tokens per second compared to the 50 to 100 range typical of large models.

The production evidence is growing. An analysis of 287 documented SLM deployments found companies like Checkr, NVIDIA, Bayer, and DoorDash replacing frontier models with 7B to 14B parameter alternatives at 5 to 150 times lower cost, with equal or better performance on their specific tasks.

But small models have real limits. They fall apart on tasks requiring deep reasoning across long, unstructured documents. Complex multi-step inference, novel problem synthesis, and ambiguous decision-making still belong to frontier architectures. Pretending otherwise leads to brittle systems.

A Decision Framework for Model Selection

The architectural question isn't "which model is best." It's what the specific task actually requires.

Route to a small model when the task is structured, repeatable, and well-defined. Classification, entity extraction, document routing, templated generation, API orchestration, and status parsing all fit. If you can describe the task with clear input-output examples and the domain is bounded, a fine-tuned small model will likely match frontier performance at a fraction of the cost.

Route to a frontier model when the task demands open-ended reasoning, novel problem-solving, or synthesis across large unstructured contexts. Strategic analysis, complex code generation, multi-document research, and ambiguous judgment calls still benefit from frontier-scale reasoning. These tasks involve genuine inference, not pattern execution.

The hybrid architecture is where most production systems should land. Use a frontier model as the orchestration layer for planning, decision routing, and edge cases. Deploy fine-tuned small models as the execution layer for the high-volume structured tasks that account for the bulk of actual inference calls. One documented deployment using this approach, a frontier model as "master controller" with specialized small models handling task execution, showed a 90% reduction in monthly API costs and a 70% improvement in response speed.

The Vendor Lock-In Problem

There's a second cost that doesn't show up on the monthly invoice. Every API call to a frontier model is a dependency you don't control. Pricing changes, rate limits, model deprecations, and terms-of-service updates all happen on someone else's timeline.

Fine-tuned small models running on your own infrastructure eliminate that variable. You control the model weights, the serving stack, the update cycle, and the data pipeline. For regulated industries where sensitive data can't touch third-party APIs, self-hosted small models aren't just a cost optimization. They're the compliance baseline.

The breakeven point for self-hosting versus API consumption is lower than most teams assume. Analysis across production deployments puts the threshold around 8,000 conversations per day, or roughly $500 per month in API spend. Above that line, owning your inference infrastructure starts paying for itself.

Right-Sizing as an Engineering Discipline

Treating model selection with the same rigor you'd apply to database provisioning or infrastructure architecture is the move that separates production-grade AI systems from expensive experiments.

A frontier model is a tool. A small model is a tool. The discipline is knowing which tool fits which job, and building the architectural flexibility to use both without locking yourself into either. For most production workloads running structured, repeatable agent tasks at scale, the 7B parameter model on your own infrastructure will outperform the frontier API call to a model that's three orders of magnitude larger than what the task requires.

The smartest infrastructure decision you make this year might be choosing the smaller model, most of the time.

…
Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

$4M Revenue Per Employee Is the New Benchmark. Most Companies Can’t Get There.

Nick Talwar — Tue, 05 May 2026 14:00:00 +0000

What AI-Native Operations Actually Look Like and Why Retrofitting Falls Short

Cursor crossed $2 billion in annualized revenue in early 2026. The team that built it? Roughly 300 people. Gamma, the AI presentation platform, hit $100 million ARR with about 50 employees and has been profitable for over two years. Midjourney generates hundreds of millions in annual revenue with a team you could fit in a mid-sized conference room. Lovable reached $100M ARR in eight months with 45 people.

Meanwhile, the median private SaaS company generates about $130,000 per employee. Five years ago, $100K was considered a reasonable benchmark. At scale, the best traditional SaaS companies were proud to reach $300K.

The gap between these numbers tells you something specific about how these companies are built. All four companies I mentioned initially have something in common beyond the headcount math.

From the first hire, they were built around AI as a core operator, with every workflow, every role, and every system designed on that assumption. The label for this is AI-native.

And for founders and executives running $5-30M ARR companies right now, the gap between AI-native operations and everyone else is a competitive timeline that is already shrinking.

What "AI-Native" Actually Means at the Operational Level

The phrase gets thrown around loosely, so let me be specific. An AI-native company designs its workflows from scratch around what AI can do. Every process, every role, every system assumes AI as a core participant from day one.

This is fundamentally different from what most companies do, which is take their existing workflows and add AI tools to them. The distinction matters because the architecture of your operations determines the ceiling of your efficiency.

Consider how a traditional SaaS company handles content. A marketing team writes briefs. Writers produce drafts. Editors review. Designers format. A project manager coordinates the whole thing. Five or six people touch every piece of content before it ships.

An AI-native company designs that workflow differently from the start. AI generates first drafts from structured inputs. A single editor shapes the output. Distribution happens programmatically. The entire pipeline might involve one or two people instead of six, and the throughput is three to five times higher.

Multiply that across customer support, engineering, sales enablement, onboarding, and internal operations. The compounding effect explains how Cursor runs at $6 million per employee while companies with similar revenue require ten times the headcount.

Why Retrofitting Existing Operations Fails

The instinct most established companies have is to layer AI tools onto what already exists. Buy a few licenses, integrate a copilot, maybe automate some ticket routing. This feels productive. It rarely moves the needle in a meaningful way.

The problem is structural. Your existing workflows were designed around human throughput. Your org chart reflects that design. Your hiring plans, your meeting cadences, your approval chains, your reporting structures all assume that humans do the work and other humans coordinate that work.

Bolting AI onto this foundation creates an awkward hybrid. AI generates a draft, but then it still goes through the same five-person review chain that existed before. AI triages support tickets, but the staffing model hasn't changed to reflect the reduced load. The tool saves twenty minutes per task, but the organizational overhead around that task stays identical.

The Realistic Options for Established Companies

If you're running a $5-30M ARR company, you probably aren't going to tear everything down and rebuild from scratch. That's fine. But pretending the efficiency gap will close on its own is a mistake with a deadline.

Here's what actually works for companies that aren't starting from zero.

Start with one workflow, redesigned from zero. Pick your highest-volume, most repeatable process and redesign it from scratch with AI as the primary operator. Don't optimize the existing process. Design the new one as if the old one didn't exist. Customer onboarding, content production, and first-line support are common starting points because they're high-volume and have clear inputs and outputs. The goal is to prove to your own organization what redesigned throughput looks like before you try to scale the approach.

Hire for the new architecture. The next time you open a role, ask whether the function that role serves could be restructured around AI instead. This doesn't mean replacing people. It means designing the role so one person with AI leverage can do what previously required three. The companies generating $2M+ per employee didn't get there by giving existing employees AI tools. They built teams where every person operates as a force multiplier.

Measure the right ratio. Track revenue per employee quarterly. If you're below $150K and growing, you're adding headcount faster than you're adding efficiency. That was fine in 2020. Today, it means you're falling behind the curve that AI-native competitors are setting. For context, top-quartile SaaS companies now generate $350K-$700K per employee, and the AI-native outliers are running at five to ten times that range.

Accept that partial adoption produces partial results. A company that redesigns 30% of its operations around AI-native principles will capture meaningful efficiency gains. A company that gives everyone a ChatGPT license and calls it transformation will not. Architectural commitment drives the outcome here. Tool selection alone never has.

Sequence your investment around leverage. Most companies adopt AI where it's easiest to implement. The better approach is to start where the ratio of human labor to repeatable output is highest. That's usually operations and fulfillment, where the actual throughput gains live.

The Clock Is Running

The revenue-per-employee gap between AI-native companies and everyone else keeps widening. Gartner projects a wave of companies generating $2M+ per employee by 2030, and the leaders are already well past that mark.

For operators and founders at the $1-5M stage, this isn't a future problem. Your next funding round, your next hire, your next operational decision is happening in a market where competitors might need one-fifth the headcount to deliver the same output.

The companies that approach this as an architectural challenge will adapt. The ones running a tool-buying exercise will learn the hard way that efficiency at this scale comes from how you build, from how you design the work itself.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

The Job Title That Didn’t Exist Last Year

Nick Talwar — Tue, 21 Apr 2026 12:07:28 +0000

Why Enterprise AI Needs a Translation Layer Between Data and Decisions

Gartner projects that over 40% of agentic AI initiatives will be abandoned by 2027. Reading that, a reasonable person might conclude that there is an inherent issue with the technology.

However, I know from my own experience building agents that when done correctly, they deliver.

The failure pattern we keep hearing about has nothing to do with model quality or infrastructure maturity. It's that organizations have no single agreed-upon definition for their own data.

Different departments define the same terms differently, and agents consume whatever definition they hit first at 10x the speed any human team would. Humans reconciled those gaps in quarterly meetings and footnotes. Agents just produce confident, expensive wrong answers.

The real fix requires a role that most companies haven't named yet.

When "Revenue" Means Different Things

Humans have always tolerated semantic drift inside organizations. If marketing and finance calculate revenue differently, they reconcile the gap in quarterly meetings or bury it in footnotes. The cost of ambiguity stayed low because humans processed data slowly enough to catch the mismatches.

AI agents don't reconcile by themselves. They ingest whatever schema they can access, apply whatever definition they encounter first, and produce output that sounds authoritative regardless of whether the underlying logic holds.

The confidence of the output actually makes the problem worse, because stakeholders trust polished summaries more than they trust raw numbers.

The Role Sitting Between Data and Meaning

The people solving this problem function as a translation layer between raw enterprise data and business meaning. They define what terms actually mean across the organization, map those definitions into the semantic structures that AI systems rely on, and maintain the consistency of that layer as business logic evolves.

The skillset is specific and rare. You need someone who understands data modeling well enough to audit pipeline logic, but who also understands the business well enough to know that "active customer" means something different to the retention team than it does to the billing team. You need someone who can sit in a room with a CFO and a data engineer simultaneously and translate in both directions.

Most companies don't have this person because the job didn't exist until AI agents started consuming enterprise data fast enough to make the gaps visible.

Some organizations are calling this a semantic architect. Others are folding it into "context engineering," which has emerged as a recognized discipline for designing the information environment that AI models operate within.

Cognizant's CIO, Neal Ramasamy, recently described context engineering as the factor that separates enterprise AI experimentation from sustainable scale, noting that most of the critical context in organizations still lives in people's heads rather than in systems where agents can access it.

Whatever you call the role, the function is the same: someone owns the relationship between what the data says and what the business means.

What This Role Could Look Like

Here's how I'd scope this role if I were hiring for it today.
This person sits between the data engineering team and business leadership. They own the company's business glossary, the single source of truth that defines what every key term means across the organization.

Before any new data source enters the AI pipeline, they confirm that field names map to actual business logic. When two departments define "customer" differently, they make the call on which definition the system uses. And they have enough authority to make that call stick.

The technical work is straightforward. The hard part is the authority. A semantic layer without organizational backing is just a wiki nobody reads.

The semantic layer market is projected to grow from $2.7 billion to $7.7 billion by 2030 precisely because companies are realizing that the technical infrastructure only works when someone with real authority governs it.

The Org Chart Hasn't Caught Up

Companies are spending millions on model selection, compute infrastructure, and agent orchestration while leaving the semantic layer as an afterthought managed by whichever data engineer happens to notice the inconsistency. It's the organizational equivalent of building a Formula 1 car and forgetting to hire someone who reads the track map.

The companies getting reliable output from their AI systems in 2026 will be the ones that treated this translation function as a first-class strategic hire, reporting to the CTO or CDO with real authority over definitions. The ones still debugging confident-sounding garbage will be the ones who assumed the data would speak for itself.

It won't. It never did. Humans just papered over the gaps. AI agents don't have that option.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

The 8-Hour Agent Doesn’t Fit Into Your Business Model

Nick Talwar — Tue, 14 Apr 2026 13:26:25 +0000

Why AI Workstream Duration Changes Everything About Hiring, Teams, and Accountability

A year ago, agents could reliably handle about an hour of autonomous work. Tasks like summarizing a document or running a data pull. Useful, but contained. You could bolt those tasks onto existing workflows without changing anything structural.

That window is closing fast.

METR, the AI evaluation research organization, published findings last year that reframed how I think about planning horizons:

The length of tasks that frontier AI agents can complete with 50% reliability has been doubling approximately every seven months.
In the 2024-2025 period, the pace accelerated to roughly every four months.
Agents that managed one-hour workflows in early 2025 will be handling full eight-hour workstreams by late 2026.

An eight-hour workstream is a fundamentally different unit of work than a one-hour task. And most companies have no operating model for that.

The Staffing Problem Nobody's Solving Yet

When an agent handles a one-hour task, it fits neatly inside your existing org chart. But when an agent handles an eight-hour workflow, you've crossed into project-level work.

This raises questions your org chart wasn't designed to answer. Who scopes the work? Who reviews quality at intermediate checkpoints, not just at the end? If the agent makes a judgment call four hours in that sends the remaining four hours in the wrong direction, whose problem is that?

Most executives are still thinking about AI as a task-level tool, something that makes individual contributors faster. The planning shift required here goes deeper. If an agent can own a full workday of output, you're making staffing decisions, not automation decisions. And staffing decisions cascade. They affect headcount planning, team composition, project timelines, and how you think about accountability for deliverables.

Consider a concrete example. A three-person analytics team currently handles weekly reporting, ad hoc data pulls, and quarterly business reviews. At the one-hour level, agents might handle the data pulls. The team stays intact, just faster. At the eight-hour level, an agent can own the entire weekly reporting cycle, from data extraction through visualization to narrative summary. Now you're looking at a different team shape entirely. Maybe two analysts and one workflow architect who designs and monitors the agent pipelines. Same output, different organizational logic.

Tomasz Tunguz has been writing about this transition from the venture side. He's running 31 agent tasks a day through his own workflows and watching software engineers manage 15 parallel AI workstreams through GitHub. The throughput numbers are real. But throughput without organizational redesign just creates a different kind of mess.

What Breaks When You Map Agent Capabilities Onto Human Structures

Here's where most companies get stuck. They take their existing team structure, identify tasks within that structure, and hand those tasks to agents. That works fine at the one-hour level. At the eight-hour level, you start hitting structural mismatches.

Human team structures assume certain things. People accumulate context over days and weeks. They build judgment through repeated exposure to similar decisions. They escalate ambiguity upward. But agents don't operate on any of those assumptions. They start fresh each time (unless you architect context persistence). And they'll confidently proceed through a six-hour workflow on a flawed assumption made in hour one.

That's a critical insight for anyone planning around agent-length workflows. The longer the workflow, the more you need architectural guardrails, not because the agent is incompetent, but because compounding errors over eight hours of unsupervised work can waste the entire output.

Designing Work Around Agent-Length Workflows

So what actually changes in practice? Three things.

First, decomposition becomes an engineering discipline. When you're handing off an eight-hour workstream, the quality of your work breakdown determines the quality of the output. Vague briefs that a senior employee could interpret and correct on the fly become expensive failures when an agent executes them literally for a full workday. The skill shifts from "manage the person doing the work" to "architect the specification precisely enough that autonomous execution succeeds."

Second, review cadence matters more than review depth. A single end-of-day review of eight hours of agent work is a recipe for rework. The Deloitte research on agentic AI adoption found that organizations succeeding with agent workflows redesigned their review processes around intermediate checkpoints, not final deliverable review. The parallel in software engineering is obvious. You don't wait for the entire codebase to be written before doing a code review. You review at the pull request level. Agent workflows need the same kind of incremental quality gates.

Third, accountability has to be redesigned, not just reassigned. When a human employee produces bad work, the feedback loop is straightforward. When an agent produces bad work after eight hours, the accountability question splits in several directions. Was the specification wrong? Was the workflow architecture missing a checkpoint? Did the person who scoped the work understand what the agent could and couldn't handle? These are systems questions, not performance questions. And they require a different management muscle than most organizations have built.

The Planning Horizon Question

Companies that wait until agents can reliably own full workdays before restructuring will be rebuilding their operating models under time pressure. Companies that start now, rethinking work decomposition, review cadences, and accountability frameworks, will have the organizational muscle in place when the capability arrives.

The point here goes beyond headcount replacement. The unit of work you're managing is about to change scale. A hiring plan built around task-level automation looks very different from one built around project-level agent staffing. The team structure that works when agents handle one-hour tasks won't hold when they handle eight.

The businesses that get this right won't be the ones with the best AI models. They'll be the ones that redesigned their operations to match what agents can actually own.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

4 Questions to Redesign Your Org for AI Agents

Nick Talwar — Tue, 07 Apr 2026 16:01:04 +0000

What High-Performing AI Companies Have Already Figured Out

Every workflow has invisible seams, steps that only function because a human with ten years of context fills the gaps.

Most companies don't notice these gaps because the process works well enough and the entire human’s job is not documented, step-by-step (an unreasonable expectation, of course). What usually happens in these cases is people route around the broken handoff, apply judgment where the documentation runs out, and quietly absorb complexity that was never formally accounted for.

Oftentimes, humans supporting and filling gaps is great when humans run the workflow. But as the use of AI agents begins to rise, things start to change and each one of these gaps become places where the agent fails and never picks up.

Drop an agent into a workflow built on informal human compensation, and the agent will execute the process exactly as written. Which means the real question is whether the workflow itself was ever designed to run without a human quietly holding it together.

For most companies, the answer is no. And that means the work needs to start with workflow redesign.

Why Pilots Succeed and Scaling Breaks

Pilots work because a small team compensates for every gap the agent can't handle. Scale via Agents and technology removes that team. What's left is a workflow designed for humans, now being executed by software with zero tolerance for ambiguity.

Agents don't adapt to broken handoffs. They don't infer ownership when it's unclear. All they do is follow the process as defined.
If the process is being held together by informal knowledge and human workarounds, the agent will expose every seam.

About 90% of the function-specific AI use cases that hold real transformative potential are still stuck in pilot, according to McKinsey. The problem is process and workflows, not technology. High-performing AI companies are roughly three times more likely to redesign workflows from scratch rather than layer agents onto what already exists. The redesign is where the real value lives.

What Workflow Redesign Actually Looks Like

Redesigning a workflow for agents means answering four questions at every stage of the process.

Question 1: Which steps can an agent fully own?

These are tasks with clear inputs, defined outputs, and minimal need for contextual judgment. Data extraction. Standardized formatting. Pulling records from structured sources. If the step can be described as a contract (this input produces this output, within these constraints), an agent can own it.

Question 2: Which steps require a human decision point?

Anywhere the process involves evaluating trade-offs, exercising risk tolerance, or making a call that depends on relationships or institutional context. These steps don't disappear when agents arrive. They become more visible, because the agent will stop and wait rather than guess.

Question 3: Where does the agent hand back?

The handoff points matter more than most teams realize. A poorly defined handoff creates the same ambiguity problem that broke the original workflow. Every transition between agent and human needs an explicit output contract. The agent delivers a specific artifact, in a specific format, with a clear expectation for what the human does next. Vague handoffs like "the agent prepares a draft for review" just move the ambiguity to a different part of the chain.

Question 4: What does the output contract look like at each stage?

This is where most redesigns fail quietly. Teams define what the agent does but skip defining what "done" looks like at each step. Without an output contract, downstream steps inherit uncertainty, and the compounding effect makes the whole workflow fragile.

This Is an Org Design Decision

Most conversations about AI agents stay in the technology lane. Which model, which framework, which vendor.

But deploying an agent into an existing workflow is an organizational design decision. You're changing who does what, where decisions get made, and what information flows where. That makes it a structural change to how your operation runs, and it deserves the same rigor you'd apply to any reorg.

Skipping the redesign means the agent will faithfully execute a process that was already broken. It will do it faster, at scale, and with none of the informal corrections that made it barely work before. Every workaround your team normalized over the years becomes a failure point. Every undocumented decision becomes a gap in the chain.

The companies pulling real value from agents share one thing in common. They were willing to look at a workflow that "works fine" and admit it only works because humans have been compensating for design flaws the org stopped noticing years ago.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Why Engineering-Led AI and Agent Initiatives Collapse in Production

Nick Talwar — Tue, 17 Mar 2026 11:40:35 +0000

The staffing and governance gaps that turn working demos into unmaintainable systems

Your engineering team just showed off a new AI feature, and everyone left the room feeling good about the future of the initiative.

But fast forward three months and the system is crashing twice a week. The team is spending weeks trying to reproduce bugs that only appear in production.

In my time as a fractional CTO serving AI-first organizations, I’ve noticed that many companies structure AI projects the same way they structure any other software build. Leadership sets a roadmap, hands it to engineering, and expects execution to follow the usual patterns.

However, the underlying assumption here is that building intelligent systems follows the same rules as building deterministic ones. This assumption kills most AI initiatives within six months of launch.

The Talent Gap Shows Up Too Late

Machine learning systems break three key assumptions:

Predictable behavior
— A model that returns one answer today might return a different answer tomorrow given identical input.
Testable edge cases
— Edge cases don’t come from a finite list of scenarios you can test against. They emerge from novel combinations of features your training data never represented.
Debuggable logic
— When something fails, you can’t just step through the code to find the bug because the decision logic was learned through statistical optimization, not explicitly programmed.

Your engineering team wasn’t hired to handle probabilistic systems. They won’t naturally catch biased training data, misleading accuracy metrics, or model architectures that can’t explain their predictions. That requires ML expertise.

These aren’t skills you can pick up by reading documentation. They come from building and breaking enough ML systems to recognize patterns that lead to failure.

All too often, teams don’t realize they need these skills until it’s too late. By that time, you’re hiring someone to audit months of work and explain which architectural decisions need to be unwound.

Senior ML engineers know which approaches create technical debt you can’t maintain, which data quality problems cause drift, and which evaluation strategies mislead you during development. They catch these issues before roadmaps lock and budgets get allocated, not after engineering has already committed to the wrong direction.

Demos That Look Great Until Production

Demos operate in carefully controlled environments. The team selects clean input data, constrains the problem space to tested scenarios, and tunes prompts until the output looks impressive.

Under these conditions, AI and Agentic systems seem remarkably capable.

Production removes every safety rail. Real users submit malformed inputs and unexpected data formats. Your data pipelines fail intermittently for reasons that don’t show up in logs. Third-party APIs change their response formats without warning. Models encounter distribution shifts (patterns in the data that differ fundamentally from training data) and produce outputs ranging from subtly wrong to completely nonsensical.

Faced with these issues, an inexperienced engineering team will add retry logic, improve logging, and write better error handling. These help at the margins, but won’t fix what the team doesn’t understand.

Without instrumentation built specifically for model behavior, you’re stuck just treating symptoms. The system logs show normal operation. The model is still running. But somewhere between input and output, quality degraded in ways you never instrumented for.

This is where the lack of ML expertise during architecture becomes expensive. ML engineers build observability into the system from the start because they know models behave unpredictably in production. They instrument confidence thresholds, track prediction distributions, monitor for data drift, and create alerts when model behavior deviates from expected patterns.

Without that foundation, you’re trying to add monitoring for problems you don’t fully understand while simultaneously keeping a broken system running.

What Actually Needs to Change

The very first thing teams should do is bring in a senior ML or data science lead before finalizing the roadmap. You need ML expertise in decision-making before commitments happen, not after engineering has spent two months building in the wrong direction.

Build your operating model around daily collaboration between ML and engineering, not sequential handoffs. The traditional approach where product writes specifications, engineering builds features, and ML practitioners “add intelligence” creates silos that guarantee failure. ML engineers need to work directly with the people building data pipelines, API interfaces, and monitoring systems. These components depend on each other in ways that don’t map to separate work streams.

Establish governance before launch, not after the first incident. Define explicit boundaries: which predictions execute automatically, which require human review, and which should fail safely rather than guess. Implement monitoring that tracks model behavior, confidence score distributions, and output quality trends over time. Create clear escalation paths so when something breaks (and it will) there’s an obvious owner who can diagnose root cause and implement fixes.

This feels like overhead until you ship without it and realize nobody can answer basic questions about system behavior.

Build Systems That Actually Work

Team composition should match the problem:

ML engineers bring expertise in navigating probabilistic systems and understanding where models break.

Software engineers bring discipline around building maintainable infrastructure that operates at scale.

Product brings judgment about where automation creates value and where it introduces unacceptable risk.

All three perspectives need equal weight in planning. Companies that understand this stop launching impressive demos that collapse under real-world load. They build reliable systems that work consistently because they planned for production complexity from day one.

Get the team structure, governance, and collaboration patterns right, and technical challenges become tractable. Skip these foundational changes, and engineering will keep building systems that work beautifully until the moment they encounter reality.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

The #1 Reason Agentic AI Fails in Production

Nick Talwar — Tue, 10 Mar 2026 11:52:03 +0000

What happens when you let the LLM make every decision in Agentic AI use cases (and how to fix it)

A few months ago, I watched a Series B startup demo their “production-ready” Agentic AI system. In testing, it worked just fine. But when they gave it real users and edge cases started appearing, the behavior became unpredictable.

The issue was architectural: they’d given the LLM complete autonomy over execution decisions, and LLMs simply aren’t built to provide deterministic control at that level.

Gartner predicts that over 40% of Agentic AI projects will fail to reach production by 2027. The difference between systems that scale reliably and those that collapse under real-world conditions comes down to whether you separate reasoning from execution.

Where Failures Actually Originate

The latest LLMs demonstrate remarkable reasoning capabilities. They can break down complex tasks, weigh tradeoffs, and generate sophisticated action plans. The problem emerges when organizations confuse reasoning capability with execution reliability.

LLMs are probabilistic pattern matchers trained on text. These characteristics propagate to Agentic AI systems built on top of LLMs. They excel at understanding context and generating plausible responses. But they struggle with deterministic execution, maintaining consistent behavior across edge cases, and guaranteeing the same output given similar inputs. Even when they appear to be well understood during pre-production testing and simulation.

Zenity Labs found that classifiers fail when inputs take unexpected paths through activation space. The classifier works perfectly on inputs it recognizes, but novel paths (even semantically similar ones) can produce completely different classifications. The same dynamic applies to Agentic AI: systems trained and tested on known scenarios encounter unfamiliar patterns in production, and their responses become unpredictable.

When you let the LLM make execution decisions directly, you’re betting that production will only present scenarios the model has learned to handle reliably. That bet fails more often than teams expect.

Full Autonomy Creates Unpredictability

In production environments, Agents don’t receive clean, well-formatted inputs. They encounter ambiguity, partial information, conflicting signals, and edge cases that fall outside training distributions.

Consider an Agent tasked with processing refund requests. In testing, requests follow predictable patterns. In production, you get:

Requests that qualify for refunds but use non-standard phrasing
Borderline cases where policy interpretation matters
Situations requiring escalation that don’t match trained escalation triggers

Inputs that combine multiple issues in ways the model hasn’t seen
When the Agent has full autonomy, it must decide in real-time which action to take. Small variations in input phrasing can trigger entirely different action sequences. Run the same ambiguous request twice, and you might get different outcomes. This happens not because the model is malfunctioning, but because probabilistic systems don’t guarantee determinism.

This behavior compounds across interactions. An Agent processing hundreds or thousands of decisions daily will inevitably encounter scenarios that push it outside reliable operating ranges. Without external controls, there’s no mechanism to catch these situations before they produce incorrect actions.

The Control Layer Solution

The Control Layer architectural fix separates what LLMs do well (reasoning) from what they do poorly (deterministic execution).

In this model:

The Agent analyzes the situation and proposes an action
A control layer validates whether that action is permitted
Only validated actions execute

The control layer uses rule-based logic that encodes business constraints, compliance requirements, and operational boundaries. When the Agent proposes an action, the control layer checks:

Does this action fall within permitted operations?
Do the action parameters meet safety constraints?
Are required conditions satisfied?
Does the user context allow this operation?

If validation passes, the action executes. If not, the Agent receives feedback and can propose an alternative. Taking time to address these questions as a team, distill it into requirements, and then work with engineering to distill them into a Control Layer architecture is a core mitigation strategy for these business risks.

This architecture maintains the Agent’s flexibility while ensuring predictable boundaries. The Agent can still reason about complex scenarios and adapt to novel situations. The control layer ensures that adaptation happens within defined limits.

The Right Level of Control

Building systems that consistently do the right things matters more than maximizing autonomy.

Control layers define boundaries that let Agents operate confidently within them. Inside those boundaries, Agents can be remarkably flexible, adapting to novel scenarios and learning from outcomes. The boundaries simply ensure that adaptation doesn’t violate business requirements or create unpredictable behavior. It also gives you a backstop to monitor and close feedback loops, slowly improving the system over time so less escalations occur.

Organizations that skip this step typically discover the need for controls after production failures. By then, retrofitting governance becomes significantly harder than building it from the start (akin to putting a genie back in a bottle).

The systems that succeed in production share a common architecture: they separate reasoning from execution, maintain clear decision boundaries, and enforce validation before actions reach production systems. That architectural choice (more than model selection, training approach, or testing strategy) determines whether Agentic AI delivers predictable value or unpredictable failures.

.…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

8 Core Constraints for Building Production-Grade AI Agents

Nick Talwar — Sun, 01 Mar 2026 19:11:31 +0000

The engineering requirements most teams ignore

Most AI agent implementations fail between prototype and production. Teams focus on conversational fluency and assume the LLM (underneath each Agent) handles complexity. Then they deploy, and realize the system wasn’t built to run reliably.

Agents are stateful, tool-orchestrating systems that operate across multiple services and failure domains. They require explicit architectural constraints at every layer, from how state persists between turns to how tools enforce security boundaries.

This list covers the eight foundational constraints required for agents to run reliably in production environments where observability, recoverability, and maintainability matter more than demo magic.

1. Explicit State Management Architecture

Agents maintain context across multi-turn workflows, often spanning minutes or hours. State management determines whether that context survives failures, supports concurrent sessions, or creates race conditions that corrupt data.

Production agents require persistent state stores with transactional semantics. In-memory state works for development but disappears on restart. External stores like Redis, Postgres, or vector databases provide durability. The architecture must define checkpoint boundaries where state snapshots are persisted, enabling recovery from interruptions or system crashes without losing workflow progress.

Agents handling multiple users simultaneously need session isolation to prevent cross-contamination. The state schema must version transitions to support rollback when agents make incorrect decisions that require human override.

2. Deterministic Tool Interface Contracts

Tool contracts must define exact input schemas, output formats, and failure modes. JSON schemas with strict type validation prevent the agent from passing malformed parameters. Return values need consistent structure, whether success or error, so the agent’s reasoning layer can parse results reliably. Omitting error handling creates black holes where tool failures cascade into hallucinatory responses instead of graceful degradation.

Tool descriptions matter more than most teams assume. The agent uses these descriptions to decide when and how to invoke each tool. Vague descriptions produce incorrect tool selection. Precise descriptions that include constraints, prerequisites, and side effects guide the agent toward correct behavior. For example, a database query tool should specify read-only vs write permissions, maximum result set size, and timeout behavior.

Idempotency becomes critical for tools that modify state. If the agent retries a failed API call, the tool should handle duplicate requests without double-charging, double-booking, or creating duplicate records. Either implement idempotency keys at the tool layer or design tools to check state before executing write operations.

3. Testable Prompt Design and Versioning

Prompts are code. They define agent behavior, and like code, they change frequently. Without versioning and testing, prompt updates break production agents in ways teams discover only through user complaints.

Each deployment should reference a specific prompt version with rollback capability. Changes should go through diff reviews where teams evaluate how modified instructions affect agent reasoning. Semantic versioning applies here as well: minor tweaks get patch versions, instruction changes get minor versions, and persona overhauls get major versions.

Testing prompts requires adversarial scenarios beyond happy paths. Agents need guardrails against prompt injection where user input attempts to override system instructions. Test cases should include malformed inputs, edge cases that expose reasoning gaps, and scenarios where the agent should refuse to act. Evaluation frameworks that score prompt versions against test suites enable objective comparison before deployment.

Prompt complexity compounds maintenance burden. Long system prompts with dozens of edge case instructions become brittle and contradictory. Factor complex prompts into modular components where base instructions handle general behavior and tool-specific prompts augment reasoning for particular contexts. This reduces prompt debugging from parsing 5000-token blocks to isolating which module broke.

4. Scoped Memory Architectures with Retention Policies

Memory determines whether agents provide personalized, context-aware responses or repeat themselves like stateless chatbots. But unmanaged memory becomes a liability where agents over-index on outdated information or leak sensitive data across sessions.

Three scopes matter here. User-level memory stores preferences and historical context specific to an individual. Session-level memory handles current conversation state that should expire after task completion. System-level memory tracks operational metadata like feature flags or configuration changes affecting all agents. Mixing these scopes is where things break. Privacy violations when session data bleeds into system memory, performance issues when user context loads globally.

None of this works without retention policies. Conversation history might keep the last 50 turns with automatic summarization of older content. Personal preferences persist indefinitely but should support deletion for compliance. Skip this step and memory stores grow linearly with usage until queries slow to a crawl. Every piece of stored memory needs a defined lifespan or an explicit reason to persist.

Then there’s the retrieval problem, and it’s the one most teams underestimate. When an agent has thousands of past interactions, pulling all of them for every query tanks both latency and relevance. Semantic search over embedded memories solves this by surfacing only what’s contextually useful. Layer in ranking by recency, relevance, or explicit user priority, and agents start behaving less like databases and more like colleagues who actually remember what matters.

5. Comprehensive Observability and Tracing

Production agents fail in ways demos never encounter. Without observability, debugging becomes guesswork where teams reproduce issues locally but can’t diagnose production failures.

Distributed tracing captures the full execution path. Each agent decision, tool call, and LLM invocation becomes a span with timing data, inputs, outputs, and metadata. Nested spans show hierarchical relationships where a high-level task decomposes into subtasks. This visibility turns opaque failures into clear sequences showing exactly where and why the agent diverged from expected behavior.

Metrics track operational health. Token usage per request prevents runaway costs. Latency distribution identifies slow operations that degrade user experience. Error rates by tool or reasoning step highlight specific failure modes. These metrics feed into dashboards where teams monitor production agents and set alerts for anomalies.

Logging complements tracing with semantic events. When an agent makes a decision, log the reasoning steps and confidence scores. When a tool call fails, log the error and the agent’s recovery strategy. Structured logs with consistent schemas enable aggregation and analysis across thousands of agent sessions, revealing patterns that individual traces miss.

6. Guardrails and Safety Boundaries

Agents with unrestricted access to tools become security liabilities. Guardrails enforce what agents can and cannot do, preventing both accidental misuse and malicious exploitation.

Input validation happens before reasoning. User prompts should pass through filters that detect prompt injection attempts, personally identifiable information, or requests that violate usage policies. Agents should never receive raw, unvalidated input directly from external sources. Preprocessing layers sanitize inputs and reject requests that exceed safety thresholds.

Output validation prevents harmful responses. Even when reasoning appears sound, agent outputs should go through guardrails checking for toxicity, bias, hallucinated facts, or leaked secrets. Automated checks combined with sample-based human review catch issues before users encounter them.

An agent should only invoke tools necessary for its designated tasks. Role-based access control maps agent roles to permitted tool subsets. For example, a customer support agent might query databases but never write to them. An internal automation agent might trigger workflows but never access customer data. Enforcing these boundaries at the orchestration layer prevents privilege escalation.

7. Error Handling and Graceful Degradation

How an agent handles failures determines whether it recovers gracefully or collapses into unusable states.

Retry logic with exponential backoff handles transient failures. If a tool call fails with a 503 error, the agent should retry after a delay rather than immediately halting. But retries need circuit breakers to prevent cascading failures where repeated attempts overload already struggling services. After consecutive failures, the circuit opens and the agent switches to degraded mode.

Fallback strategies maintain functionality when primary paths fail. If real-time data retrieval fails, the agent can fall back to cached data with appropriate disclaimers about staleness. If the preferred LLM provider is unavailable, routing to an alternative model allows continued operation with possibly reduced quality. Explicitly designed fallbacks prevent complete service outages.

When automated recovery fails, agents should recognize their limitations and request human intervention. This requires defining escalation triggers based on confidence scores, repeated failures, or task criticality. Clear handoff protocols ensure humans receive sufficient context to take over without starting from scratch.

8. Security Controls for Tool Execution

Tools give agents power to act on external systems. Without security controls, compromised agents or malicious inputs can cause real damage.

Authentication and authorization apply to every tool invocation. Agents should authenticate to tools using credentials scoped to specific operations. OAuth tokens, API keys, or mutual TLS certificates ensure only authorized agents access sensitive resources. Credentials should never appear in prompts or logs, stored instead in secure vaults with automatic rotation.

Data validation prevents injection attacks. When agents construct SQL queries, API requests, or shell commands, parameterized inputs prevent injection. Never interpolate user input directly into executable statements. Sanitization layers validate data types, ranges, and formats before tools process them.

Audit trails track every tool execution. Who invoked which tool, with what parameters, at what time, and with what result should be immutably logged. These audit logs support security investigations, compliance requirements, and forensic analysis when things go wrong. Retention policies must balance storage costs against regulatory and operational needs.

Rate limiting protects against abuse. Agents might loop on tool calls or malicious inputs might trigger excessive API usage. Per-agent, per-tool, and per-user rate limits prevent runaway resource consumption. Adaptive limits that adjust based on historical behavior provide flexibility while maintaining safety boundaries.

Constraint-First Design as Production Requirement

Production-grade AI agents require constraint-first design. The conversational interface obscures the fact that these systems persist state, orchestrate tools, and make decisions affecting real operations. Each constraint in this list addresses a failure mode that becomes evident only after deployment, when agents face concurrent users, degraded services, and adversarial inputs.

These constraints interconnect. State management enables graceful error handling through checkpointing. Observability depends on deterministic tool interfaces that produce consistent, traceable outputs. Security controls layer on top of explicit memory scopes that prevent cross-session contamination. The architecture succeeds when these constraints compose into systems that handle both expected operations and the inevitable failures production environments create.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

The Hybrid AI Model Framework: Own What Matters, Rent What Doesn't

Nick Talwar — Fri, 20 Feb 2026 14:17:42 +0000

A two-layer architecture that treats enterprise data like a true asset

LLMs come with fundamental operational and security-related problems:
They hallucinate

They don't understand your specific business context without extensive prompt engineering

Once your data enters external systems, monitoring who accesses it becomes extremely difficult

A hybrid AI model helps to combat these issues. Instead of retrofitting security onto external systems, you build with two distinct layers from the start. You run a proprietary core trained on your fragmented internal data. You use generalized LLMs as utilities for non-sensitive tasks.

Different problems require different tools, and your most valuable data deserves more than API-level protection.

How a Hybrid Model Works

A hybrid setup operates with two distinct layers, each designed for different types of work.

The Core Proprietary Model handles everything that requires institutional knowledge or contains sensitive information. This layer gets trained or fine-tuned specifically on your internal data. The fragmented information sitting across databases, documentation systems, and tribal knowledge that actually runs your business. You deploy it privately (air-gapped, on-premises, or in tightly controlled infrastructure). You own it, govern it, version it.

The Generalized LLM Layer functions as a utility, similar to electricity or cloud compute. Use it for broad reasoning tasks, general drafting, summarization, anything that doesn't touch sensitive context.

Regulated customer data, competitive intelligence, and process IP stay in the proprietary core. General business tasks that could happen anywhere go to the utility layer.

Why This Works

It Eliminates Prompt Engineering Overhead

When your core model already understands domain-specific terminology, business rules, and institutional patterns, the prompt complexity drops. You stop spending cycles explaining your context in every interaction.
In my work with companies moving domain-specific work to fine-tuned internal models, I've seen prompt engineering overhead drop by 50-60%. The model knows product SKUs, understands compliance requirements, recognizes org structure. Questions that would require three paragraphs of context setup with ChatGPT work with a single sentence.

It Turns Fragmented Data Into an Asset

Fine-tuning a model on this distributed knowledge creates something actually useful. A unified intelligence layer that has ingested and made sense of information across silos. The model becomes a practical interface to knowledge that was previously locked away.

It Preserves Privacy Without Killing Usability

The user experience can look nearly identical to ChatGPT. What changes is what sits behind that interface.

The sensitive operations happen in infrastructure you control:

Customer PII never touches OpenAI's servers
Competitive analysis stays internal
Compliance teams can audit exactly what data moves where

Once data enters a big tech system, monitoring who accesses it becomes extremely difficult. Current privacy regulations create genuine liability when you can't track data lineage.

It Reduces Black-Box Provider Risk

The hybrid model limits exposure by keeping your most sensitive information completely separate from external systems. You're not trusting a third party to respect your anonymization or to maintain proper access controls. The data simply never leaves your environment.

When you own the core, you control the governance model, the retention policies, the access logs. When you rent utilities, you're only exposing information you'd be comfortable seeing anywhere.

When to Own, When to Rent

The decision framework comes down to three questions.

Does this task require institutional knowledge?

If the answer depends on understanding your specific processes, products, or customer context, it belongs in the proprietary core. If any competent professional could handle it with general knowledge, it can run through the utility layer.

What's the sensitivity level?

Regulated data, competitive intelligence, unreleased product details all stay internal. General business writing, research summaries, basic analysis can use external LLMs.

What's the cost of being wrong?

If a hallucination or data leak creates regulatory exposure, reputational damage, or competitive harm, you need the control that comes with ownership. If mistakes are cheap to catch and fix, utility models work fine.

Most enterprises find that 20-30% of their AI workload truly requires the proprietary core. The rest can run on general utilities, where you benefit from continuous model improvements without the maintenance burden.

Building for the Long Term

The hybrid approach requires upfront investment. You need to train or fine-tune models, set up private deployment infrastructure, and establish data pipelines. But the payoff is control over your most sensitive operations and ownership of the intelligence you develop.

The risks of sending enterprise data through external systems are very real: data leakage, compliance violations, and loss of competitive intelligence are real outcomes that enterprises can't afford. The hybrid model eliminates these exposures by keeping sensitive work on infrastructure you control.

Once you're operational, your most frequent queries run at marginal cost. Every interaction with your proprietary model generates data you can use to improve it. The intelligence stays with you.

…
_Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

Stop Funding Stanford Grads. Start Funding These AI Founders Instead

Nick Talwar — Tue, 17 Feb 2026 15:01:04 +0000

Why bootstrapped operators with 18 months of AI-first operations are beating polished pitch decks

The $40 billion OpenAI round made headlines. The $13 billion Anthropic raise dominated tech news. Yet small AI businesses you haven’t heard of are generating revenue with two-person teams and unit economics that would make a Series B CFO jealous.

Most started bootstrapped, figured out AI integration through necessity, and built operational efficiency that funded companies spend a year trying to achieve. They’re not in your pitch meetings because they’re serving customers, not raising capital. That’s what makes them worth finding.

These bootstrapped operators are valuable because of the operational maturity they’ve already built. The profile is specific, the advantages are real, and the sourcing approach requires a different strategy than waiting for warm intros.

Signals That Actually Matter

Revenue at this scale tells you three things that matter more than credentials:

1) They found product-market fit without burning millions to discover it. Someone generating $50K MRR with a two-person team validated the problem, built something people pay for, and figured out unit economics that work. The hard part is done. Capital scales what’s proven rather than funding the search for what works.

2) They already have distribution figured out. TikTok channels with engaged audiences. Content engines that drive consistent traffic. Communities in their vertical that trust them. They have channels that already work and capital makes them more effective.

3) They built capital-efficient operations by necessity. Bootstrappers automate everything possible because they have to. That discipline compounds when you add capital.

Vertical Focus Creates Actual Moats

Here’s where the real defensibility lies. The bootstrapped operator already owns a specific niche with proprietary data accumulated through serving real customers.

Help them go deeper in that vertical rather than pushing them horizontal. A healthcare billing tool with 18 months of claims data and AI models trained on actual adjudication patterns has a moat. A generic “AI business assistant” has nothing but hope that OpenAI won’t crush them next Tuesday.

Horizontal AI products face commoditization risk from foundation model providers. OpenAI, Anthropic, Microsoft, and Google are actively building horizontal capabilities. They have more capital, more compute, and faster iteration cycles than any startup. Competing there is choosing to run uphill into machine gun fire. On the other hand, vertical AI with proprietary data and tight context wins because foundation models don’t have access to that specific corpus.

Where to Look

The best AI and Agent founders are building businesses and talking to customers.

This creates a sourcing problem for VCs used to founders seeking them out. You’ll need to go find these operators rather than waiting for them to apply. Look at who’s building in public on social media, who has small but profitable SaaS businesses, who’s actually shipping AI features that customers pay for.

The signal you’re looking for is revenue combined with operational maturity. Someone generating $30K MRR with a two-person team has already solved the hardest problems: finding customers, building something people pay for, and making the economics work. Capital helps them scale what’s proven, not figure out if it works.

When you find these operators, the pitch might look different than what you’re used to. They’re not asking you to believe in a vision. They’re showing you a working business and asking for help growing it. The unit economics are there. The customer feedback is real. The operational playbook exists. Your job becomes easier because you’re funding execution, not theory.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

How To Build an AI or Agent Business With a Moat

Nick Talwar — Fri, 13 Feb 2026 14:53:16 +0000

Why sustainable AI advantage comes from what you capture, not what you rent

The conversation around AI and Agentic defensibility has become oddly ideological. One camp insists you must own your models and run everything locally to avoid vendor lock-in. The other argues that APIs are the only economically rational choice and worrying about moats is premature optimization.

In my personal opinion, both camps miss the point.

The real question isn’t whether you use APIs or local models. It’s whether you’re designing systems that generate defensible advantages regardless of whose intelligence you’re using.

Foundation model capabilities are commoditizing rapidly. What doesn’t commoditize is the proprietary data your product generates through deeply embedded user workflows.

The Dependency Problem Isn’t What You Think

When people worry about API dependency, they usually frame it as vendor lock-in or cost exposure. Those concerns are valid, but in my opinion secondary.

The actual problem is subtler. When you build on rented intelligence without capturing high-signal data from your users’ workflows, you’re constructing a business on sand. Your competitors can access the same models, implement similar features, and match your outputs. There’s nothing structural keeping users with you.

Model commoditization accelerates this dynamic. For example, GPT-4 was a leap forward when it launched. Within months, multiple providers offered comparable capabilities. Claude Sonnet raised the bar again. The cycle continues.

If your competitive advantage depends on having access to a “better” model, you’re playing a game with a six-month time horizon.

This is why horizontal products built on foundation models alone are vulnerable. They’ll get subsumed by the model providers themselves or by anyone else with API access. But AI in a vertical niche, tied to proprietary data that the big companies can’t crawl, is a different story. It gets better with tighter context over time, compounding advantages that generic horizontal tools can’t replicate.

Start With APIs and Design for Moat, Using this Architecture

Here’s where pragmatism matters. Starting with APIs makes sense for almost everyone.

The unit economics are clear. Google GCP, Microsoft Azure, and Amazon AWS achieve economies of scale you cannot replicate early on. They’ve optimized inference costs, distributed infrastructure globally, and handle reliability at a level that would consume your entire engineering budget to approximate. For validation, iteration, and early growth, APIs are the rational choice.

The strategic error isn’t using APIs. Rather it’s treating them as permanent infrastructure without building anything that compounds independently.

Think about what happens when you route user requests through an API, get responses, and return results. You’ve delivered value, but you haven’t captured anything proprietary. You have server logs and usage metrics, but those don’t differentiate you. Your competitors can implement the same flow tomorrow.

Now consider a different architecture:

User makes request.
System captures structured context about the request.
Routes to API.
Gets response.
User provides feedback (implicit or explicit).
System logs the feedback alongside the original context.

Over time, you accumulate a dataset mapping contexts to outcomes, refined by actual user behavior.

This is data exhaust. It’s high-signal, structured data generated as a byproduct of delivering value. And unlike API access, it’s yours.

Three Mechanisms That Create Distance

Defensibility in AI and Agents comes from three interlocking mechanisms. You need all three working together.

Workflow integration means your AI is embedded in the actual workflow, handling steps that used to require manual effort. This is what it means to build AI-first and AI-native operations. The more deeply integrated, the higher the switching cost. Users aren’t just losing access to a tool. They’re losing a system they’ve built their process around.

I’ve seen companies build AI features that users love but treat as optional supplements. Those products get replicated easily.

Compare that to systems where the AI handles core workflow steps and generates structured outputs that feed into downstream processes. Pulling that out means rebuilding workflows, not just swapping tools.

Data exhaust generation requires intentional design. Not all product usage produces useful data. You need to capture context, actions, and outcomes in a format that improves future model performance or product decisions.

The best data exhaust comes from correction loops. User generates output through your system, edits or refines it, approves final version. You now have ground truth for that context. Do this across thousands of users and you have a training corpus competitors can’t access.

Feedback loops turn data exhaust into compounding advantage. Every user interaction generates data. That data improves model performance, prompt engineering, or product features. Better outputs increase usage. More usage generates more data. The cycle accelerates.

This is where time horizon becomes a moat. A competitor can replicate your current product, but they start the flywheel from zero. You’re already three thousand iterations ahead. The gap widens with each cycle.

Building the Feedback Loop That Compounds

The most defensible AI and Agentic businesses design feedback loops from day one, even when using third-party APIs for inference.

Here’s what this looks like in practice. A user submits a request through your product. Before sending to the API, you capture structured context: user role, task type, key parameters. The API returns a response. You present it to the user. The user takes action: accepts, modifies, rejects. You log the outcome alongside the original context.

Over time, you build a proprietary dataset mapping contexts to outcomes, weighted by user behavior. This dataset has immediate value (you can fine-tune models, improve prompts, optimize for user preferences) and compounding value (it grows with usage and becomes harder to replicate as it scales).

This approach works identically whether you’re using Anthropic’s API or running Llama locally. The moat comes from the data layer, not the inference layer.

Companies that implement these loops see the impact quickly. Within 6 months, you’ll have accumulated enough signal to meaningfully outperform competitors starting from zero. The efficiency gains compound.

But you have to be willing to roll up your sleeves and actually instrument these systems properly, not just subscribe to ChatGPT and call yourself AI-native.

What This Means for Builders

The strategic landscape for AI and Agentic businesses is clarifying. Model access is commoditizing. Speed to market still matters, but not as much as designing for defensibility early.

The right approach is API-first for validation and growth, coupled with deliberate architecture for data moat. Don’t avoid APIs because of misplaced concerns about dependency. Use them pragmatically. But design your product so it generates and captures high-signal data from embedded workflows.

Your moat isn’t the model you use. It’s what you capture while using it.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.

→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.

→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.