← Writing

The Data Problem No One Wants to Talk About

Most discussions about data in AI focus on quality. The harder problem is ownership: who controls the data, who is accountable when it is wrong, and what happens when two teams disagree about which version is right.


The problem is not that data is messy.

Every organization with data knows it is messy. That has been true for a long time, and most teams have found ways to work around it.

The real problem is that data is owned, and the ownership is almost never clean.

In most organizations, data does not exist as a single coherent asset.

It exists as a collection of systems, each built for a different purpose, owned by a different team, with different assumptions about what matters.

Those assumptions are invisible until you need to act on the data.

Then they surface everywhere.

A decision system that needs to combine inventory, pricing, and store-level signals will depend on several different systems. Each has its own definition of accuracy. Its own update cadence. Its own gaps and failure modes.

When something goes wrong, the question is not just "what is the right number?"

It is "which system is the source of truth?"

And often, there is no clear answer.

This is where most AI projects quietly slow down.

Not in modeling. In reconciliation.

What looks like a straightforward input becomes a negotiation.

Which version of this metric do we use?
Why does this number not match what finance reports?
Who is responsible for fixing it?
What happens if we proceed anyway?

These are not technical questions. They are organizational ones. And they take longer to resolve than most projects are designed to tolerate.

Even when data is technically available, it is often not operationally usable.

It arrives too late.
It is aggregated at the wrong level.
It is stored in a format that does not match the workflow.
It changes without notice.

The model can be correct. The system can still fail.

Because the data does not behave the way the decision requires.

There is also a layer that is harder to see.

Data is not neutral.

It encodes how the organization understands its own work: what gets measured, what gets ignored, what gets simplified, what gets approximated.

A model trained on how decisions were reported will behave differently from a model trained on how decisions were actually made. In most organizations, those are not the same dataset. The reported version is cleaner, more consistent, and less true.

The model learns that version of the world. And it operates accordingly.

This is not an accuracy problem in the usual sense. The numbers are often correct. They just do not represent what they claim to represent.

Ownership surfaces again at the point of failure.

When data is wrong, who is responsible?

Not in theory. In practice.

Who gets notified when a number is off? Who has the authority to change the definition? Who is accountable for the consequences of acting on incorrect data?

If those answers are unclear, the system has no stable foundation.

The model ends up sitting on top of something that shifts over time, without a reliable mechanism to detect or correct it.

This is why data readiness is consistently overestimated.

Organizations do not lack data.

They lack agreement.

Agreement on definitions. Agreement on ownership. Agreement on what is good enough to act on.

Without that agreement, the system cannot stabilize. Improvements to the model get absorbed by the instability underneath.

The conversation about data in AI is almost entirely about pipelines and quality.

Those are real problems. They are also the easier ones.

The harder problem is who owns the number, who is accountable when it is wrong, and what happens when two teams disagree about which version is right.

Those are not data engineering problems.

They are organizational problems that happen to manifest in data.

And they do not get solved by better tooling.