Your “Simple” LLM Feature Isn’t Simple After Launch

Five production patterns from the first five AI in Production Field Notes

If you own the ML platform, run infra, or ship LLM features that have to survive real traffic, you already know the punchline:

Most ML/GenAI systems don’t fail because the model is “bad.”
They fail because everything around the model gets stressed the moment users show up.

That’s why we launched the first five issues of AI in Production Field Notes: long-form writeups grounded in real production architectures, metrics, and decision frameworks. Not thought leadership. Not “AI takes.” Post-deployment notes.

This post is a short front-door summary: the five patterns we keep seeing, and what they imply for how you build.

Who this is for

ML platform owners
MLOps / infrastructure leads
Applied ML engineers shipping to production
Research engineers responsible for system reliability

If you’re still in demo-land (no latency budget, no access control, no incident response), bookmark it for later.

Pattern 1: RAG doesn’t fail because embeddings are “bad”

RAG fails because retrieval becomes a systems problem.

In the notebook, RAG looks clean: chunk → embed → retrieve → generate.
In production, it turns into:

Latency budgets you didn’t plan for (p95, not average)
Cost creep every time someone says “just add more docs”
Metadata chaos (what version, what source, what scope, what ownership)
Permissions bolted on late (and then everything breaks)
Evaluation gaps that let regressions ship quietly

The reason this hurts is simple: retrieval is now part of your app’s critical path. You’re not “adding context.” You’re operating a distributed system that decides what the model is allowed to see, under time pressure.

Opinion: If your RAG system doesn’t have a real plan for entitlements, cost tracking, and retrieval eval, it’s not a production system. It’s a demo with a pager attached.

Pattern 2: Teams confuse agents with workflows

Agents are seductive because they feel like progress. They also make failure modes harder to see.

Here’s the practical rule we keep coming back to:

If you can solve it with a workflow, an agent is usually expensive overengineering.

A workflow is deterministic. You can reason about it. You can test it. You can budget it.
An agent adds loops, tool calls, dynamic step counts, and “it depends” everywhere, which is exactly what you don’t want when reliability and cost predictability matter.

That doesn’t mean “never use agents.” It means you earn agents when simpler approaches stop working:

Start with a strong prompt + examples
Then try sequential steps with validation gates
Only then consider agentic loops

Opinion: Most agent failures aren’t “the agent isn’t smart enough.” They’re “we added agency when we needed control.”

Pattern 3: Drift isn’t an event. It’s Tuesday.

Traditional monitoring tells you servers are up. It doesn’t tell you your model is quietly becoming wrong.

In production, drift shows up as:

behavior shifts (users, fraudsters, markets, seasonality)
upstream changes (schema updates, instrumentation, pipelines)
label shifts that cascade through downstream systems

What separates mature teams isn’t “we detect drift.” It’s what happens next.

The Field Notes pattern here is: drift management becomes an ops loop:

detect anomalies
confirm drift vs noise
classify severity
diagnose likely root causes
intervene safely (retrain, rollback, repair features, canary, escalate)

Opinion: If your pipeline can’t diagnose → intervene with guardrails, you’re not running ML. You’re running a permanent incident queue.

Pattern 4: “Single-call LLM apps” turn into orchestration systems

A lot of LLM products start as: “It’s just one call.”

Then the real world arrives:

malformed outputs
partial failures
timeouts
retries that amplify cost
edge cases that break the “happy path”
evaluation you can’t do manually anymore

So the “one call” becomes a system:

decomposition into steps that each handle a specific failure mode
validation inside the loop (not downstream)
retries that are targeted (not blind)
structured outputs (stop relying on JSON prompt hacks)
asynchronous orchestration so the whole job doesn’t stall on one chunk
bulk evaluation so you can change the system without guessing

This is what production looks like: not a bigger prompt, but an orchestra of small controls that keep the app reliable under load.

Opinion: If you can’t test, measure, and retry each step independently, you don’t have an LLM app, you have a demo that occasionally works.

Pattern 5: Enterprise GenAI speed is constrained by governance + integration

In enterprise environments, “move fast” doesn’t fail because the model is slow.

It fails because:

access control can’t be an afterthought
auditability is required
data boundaries are political and technical
compliance rules change
integration with legacy systems is the actual delivery path

The Field Notes pattern: teams that ship repeatedly don’t treat governance as a blocker to “get through.”

They treat it as part of the system design, embedded early, updated dynamically, and enforced consistently.

Opinion: You don’t “add governance later.” If you try, you’ll rebuild everything under pressure.

What to do next (a quick operator checklist)

If you’re building or inheriting one of these systems, here are five questions worth answering before you scale anything:

What’s the p95 latency budget, and what happens when retrieval misses it?
Where does cost get tracked and capped (per query, per user, per day)?
Where do permissions and entitlements live, and how are they tested?
What’s your eval loop (offline + online) for catching regressions fast?
When drift hits, what’s the safe intervention path (rollback/canary/repair)?

If those questions are fuzzy, the model won’t save you.

Read the full Field Notes

This post is the short version. The long version (architectures, metrics, decision frameworks) is in the first five issues of:

AI in Production Field Notes (Substack)

Share This Post