Evaluation and monitoring are the heart of any reliable machine learning system. But finding a stable reference point, a reliable comparison baseline, or even a decent performance metric can be surprisingly difficult in a world that is beset by changing conditions, feedback loops, and shifting distributions. In this talk, we will look at some of the ways that these conditions show up in more traditional settings like click through prediction, and then see how they might reappear in the emerging world of productionized LLMs and generative models.