The Biggest Constraint Facing the MLOps World 2026 Committee, And What It Reveals About Evals (Pt. 3)

Evaluation and testing were the most frequently named constraints in our 2026 MLOps World Steering Committee survey. Parts 1 and 2 covered six patterns: incomplete metrics and proxy strategies, grounding over scoring, explainability failures, and why experienced practitioners working through all of it still converge on talking to the user. If you’re responsible for production ML, platform, infra, applied ML, or agentic systems, this is the final part.

This session covered three takeaways: why domain expertise is an evaluation method in its own right, how teams ship under uncertainty without shipping blindly, and why setting the right expectations with stakeholders matters as much as building the system.

Domain expertise is an evaluation method, not just an input

In production AI, domain experts serve a different function than they do in traditional ML. They are not just providing labels or training data. They are the evaluation layer.

Teams described using subject matter experts for periodic manual review and calibration. The question is not only whether the model is right, but whether a human expert with 20 years in the domain would give a different answer, and if so, why. That reframes evaluation from a model metric problem into a human judgment problem. It is harder to automate, but it is closer to what actually matters.

For tasks where ground truth is unstable, subjective, or context-dependent, expert agreement as the anchor is more reliable than an automated score that will inevitably drift.

Shipping under uncertainty is possible if the scope is controlled

Explainability outputs can look convincing without being causal.

Models produce reasoning traces and feature attributions that appear to explain their behaviour. When those features are tested through ablation (blanking out specific dimensions and checking whether the output changes), the explanation often falls apart.

Causality testing is the only way to build real trust in interpretability. Teams using sparse autoencoders to isolate which activation dimensions drive specific agent behaviours found that contrasted pairs (similar inputs requiring different tool selections) were the most effective way to learn what the model is actually relying on. The approach is promising. It is not yet task-agnostic and does not currently scale for real-time inference.

The practical test: if removing a feature does not change the output, you have correlation, not explanation.

AI is a decision support tool. Treat it like one.

One observation in the session reframed how the expectation problem is usually discussed. Executives routinely make decisions with unclear metrics and act on judgment anyway. The expectation that AI systems should produce precise, trustworthy outputs when human decision-makers operate without them is itself a misalignment.

The practical implication: teams need to set expectations with clients and internal stakeholders before the system is wrong, not after. If the people consuming the output expect certainty, they will lose trust the first time it fails. If they expect informed guidance under uncertainty, the same system can be useful and sustainable.

The distinction is not in the system. It is in what stakeholders are told to expect from it.

This wraps up the three-part series from our first MLOps World Steering Committee session. The committee meets across the year, and the conversation continues to shape the 2026 program.

Why this matters for MLOps World

MLOps World | GenAI Summit is built for MLOps practitioners, infra leads, platform owners, and production ML teams running systems under real operating constraints. Discussions like this shape the program, and they shape the kinds of conversations that are only possible in an undiluted room of operators.

MLOps World | GenAI Summit 2026 · Nov 17–18 · Austin,

PART 2

Share This Post