
We asked the 2026 MLOps World Steering Committee one question: what’s the hardest problem your team is actually dealing with right now?
“How do you actually know your system is working well enough to stay in production?” came back more than anything else.
Not evals as an abstract concept.
The specific, operational problem of building evaluation infrastructure while simultaneously running the system it’s meant to assess, under real constraints, on real timelines, with stakeholders who don’t share a common definition of what good looks like.
The technical side of that problem is hard enough.
You’re measuring systems that don’t behave deterministically, where the same prompt under similar conditions produces outputs that are both different and both defensible.
The frameworks you have were built for systems that return the right answer or don’t.
They don’t have a native concept of distribution, drift, or the relationship between model behavior and the business outcome it’s supposed to serve.
But what the committee kept raising was the organizational side.
Defining acceptable performance for a production LLM isn’t an engineering decision.
It touches regulatory exposure, customer trust, and business-critical outcomes, which means it requires sign-off from people who don’t share a language for risk.
Engineering can build the harness. They can’t unilaterally define what passes.
That’s frequently what stalls deployment after the technical work is done.
The practitioners shaping this year’s program have been working through both sides of that problem. They’re not looking for frameworks.
They’re looking for honest accounts from teams who’ve made specific decisions under real constraints, what they measured, what they traded off, what they got wrong first.
If that’s work you’ve done, the 2026 Call for Speakers is open.
MLOps World | GenAI Summit · November 17–18, 2026 · Austin, TX