The Biggest Constraint Facing the MLOps World 2026 Committee, And What It Reveals About Evals (Pt. 2)

Evaluation and testing were the most frequently named constraints in our 2026 MLOps World Steering Committee survey.

Part 1 covered three patterns: incomplete metrics, why hallucination resists stable measurement, and why proxy strategies are the operating reality, not a fallback. If you’re responsible for production ML, platform, infra, applied ML, or agentic systems, Part 2 picks up where that left off.

This session covered four takeaways: why grounding outperforms scoring, why explainability outputs can look convincing without being causal, why efficiency metrics can go green while the user experience gets worse, and why experienced practitioners working through all of the above still converge on the same place.

Grounding works better than scoring

Scoring a response is not the same as verifying it.

The approaches that held up moved away from overall correctness labels and toward checking outputs against something known. One pattern: build a structured facts library from trusted documents, then validate agent responses against those facts directly. The result is not a score. It is an audit trail. Hallucination is not eliminated, but it becomes traceable.

The same principle applied to summarization. Most of a generated summary is usually fine. The problem is one sentence where something gets injected that is a misinterpretation. An overall score cannot locate this. Identifying which sentence is wrong and fixing that is more useful than rating the output as a whole.

The consistent lesson: the point of failure matters more than the aggregate score.

Correlation is not explanation. Models are good at pretending.

Explainability outputs can look convincing without being causal.

Models produce reasoning traces and feature attributions that appear to explain their behaviour. When those features are tested through ablation (blanking out specific dimensions and checking whether the output changes), the explanation often falls apart.

Causality testing is the only way to build real trust in interpretability. Teams using sparse autoencoders to isolate which activation dimensions drive specific agent behaviours found that contrasted pairs (similar inputs requiring different tool selections) were the most effective way to learn what the model is actually relying on. The approach is promising. It is not yet task-agnostic and does not currently scale for real-time inference.

The practical test: if removing a feature does not change the output, you have correlation, not explanation.

Speed and efficiency metrics can mask weak systems

A fast agent is not the same as a reliable one.

When testing agents across models, teams observed that some models skip steps, make odd tool calls, or go directly to an answer without completing the expected reasoning loop. Some get the right answer anyway, through shortcuts rather than sound reasoning. Measuring only latency or throughput in tokens does not surface this.

The same gap shows up in deflection metrics. Fewer calls reaching a human agent looks like an efficiency gain, and it might be. But if outcome signals are never captured (was the issue resolved, did the user drop off, were they satisfied), the metric and reality can decouple silently. The dashboard goes green while the user experience gets worse.

The consistent lesson: efficiency proxies measure output volume, not outcome quality. Treating them as interchangeable is where the measurement fails.

When automated signals fail, the fallback is human

Multiple participants arrived at the same conclusion from different directions. When the metric cannot be trusted, talk directly to the people affected by the system’s outputs.

Every metric has a visible component and a hidden one. The hidden component is almost always tied to human behaviour, preferences, and context that dashboards do not capture. In regulated industries, where decisions carry downstream consequences, that hidden layer matters more.

Experienced practitioners can work through proxy strategies, explainability frameworks, and causal testing, and still converge on the same place. The real signal lives with the user. That convergence is not a failure of the frameworks. It is what the frameworks are pointing at.


These are 4 of the 10 takeaways noted in our meeting. In the next part, we will discuss:

  • Domain expertise is an evaluation method, not just an input
  • Shipping under uncertainty is possible if the scope is controlled
  • Why setting the right expectations matters as much as building the system

When automated signals fail, the fallback is human

MLOps World | GenAI Summit is built for MLOps practitioners, infra leads, platform owners, and production ML teams running systems under real operating constraints. Discussions like this shape the program, and they shape the kinds of conversations that are only possible in an undiluted room of operators.

MLOps World | GenAI Summit 2026 · Nov 17–18 · Austin, TX

Have something to share on stage? If you’re working through these problems and want to bring your experience to the committee or the program, we’d like to hear from you:

[email protected]

PART 1

Table of Contents
Share This Post