Evaluation Is the Hardest Part

In production, the question is not whether the model is correct.

It is whether the system leads to better decisions.

Those are different questions. And most evaluation is designed for the first one.

Accuracy is measurable. It is testable against a held-out dataset, expressible as a number, easy to report. This is part of why it dominates the conversation about model performance.

But accuracy in a controlled environment tells you almost nothing about what happens in the real world.

Data changes. The objective shifts. The people using the output make judgment calls the model never saw. What held in evaluation breaks down in ways that are hard to predict and even harder to attribute.

The evaluation passed. The system still underperformed.

The real measure is whether decisions improved.

Consider a system that supports inventory decisions.

You can measure forecast accuracy, model error, mean absolute deviation.

But the meaningful question is whether the decisions actually got better.

Did stockouts decrease? Did overstock improve? Did the team start trusting the output enough to act on it more quickly?

Those outcomes depend on more than the model. They depend on how the output was understood, which constraints were applied, and what happened in the steps between the prediction and the decision.

A model can be highly accurate and still fail this test.

If the output arrives in a format no one trusts, it will be ignored. If the recommendation conflicts with incentives no one has acknowledged, it will be overridden. If the system is right in aggregate but wrong in the cases that matter most, the damage is real regardless of the average.

Accuracy is a property of the model.

Impact is a property of the system.

They are not the same.

There is also a timing problem that is harder to solve than it appears.

Many of the effects of a decision only become visible much later.

A pricing change affects customer behavior over months. A supply chain decision compounds through a season. A planning recommendation shapes outcomes weeks after it was acted on.

By the time you observe the result, the context has already shifted. Other decisions have been made. External conditions have changed. Isolating what the system contributed becomes genuinely difficult.

This is not a modeling problem. It is a measurement problem.

And it affects more than evaluation. It affects trust.

When impact is hard to demonstrate, teams fall back on what they can show.

Usage rates. Model performance on historical data. Engagement. Metrics that are measurable and defensible, even when they are not the ones that matter.

This is not laziness. It is a rational response to measurement difficulty.

But it has consequences.

The metrics you can defend become the metrics you pursue. Over time, the system gets optimized for what is easy to report rather than what it was built to do. The gap between the nominal goal and the actual goal widens, quietly, in ways that are hard to see from inside the project.

The deeper problem is that decision quality is contextual in ways that are hard to encode.

What is "correct" depends on tradeoffs that were never fully articulated. A decision that looks suboptimal against a narrow metric may be exactly right given constraints the model did not see. A system that optimizes for a clean objective can create problems in adjacent areas that never appear in the evaluation.

Getting evaluation right means deciding in advance what good looks like. Not just technically, but in terms of the decisions the system is supposed to support.

That requires a clarity about problem definition that most projects do not have when they start.

None of this makes evaluation impossible.

But it does mean the right signals are messier than accuracy scores.

Are the decisions the system supports improving over time?
Is the output being used in the moments that matter?
Do the people using it know when to trust it and when to override it?
Are failure cases visible and recoverable?

These require longitudinal observation, not a held-out test set. They require honesty about what the system cannot do, not just what it can.

The difficulty of evaluation is one of the reasons production AI systems are harder to build than they appear.

Not because the measurement is technically complex.

Because the thing you are trying to measure, whether the system is making decisions better, is slow to observe, easy to approximate with something that looks like measurement but isn't, and easy to defer until after launch.

Most teams never fully solve it.

The ones that come closest treated evaluation as a design problem from the beginning.

Not as a report you write at the end.