Why your quality data is probably not as clean as you think

Insight Image

Here's something nobody really warns you about when you start working with data: getting the data is not the hard part. The hard part is figuring out all the ways it went wrong before it landed on your desk.

This is all about that — the common, frustrating, and often invisible ways that data gets messed up in quality and process analysis. And the honest truth is, it happens far more often than most people want to admit.

So Where Does Data Come From?

In most factories and production environments, data comes from four main places:

  1. Real-time sensors — machines and devices that collect measurements automatically while the process is running
  2. Lab tests — samples taken from production and tested separately, usually at a slower pace
  3. Historical records — old reports, specs, supplier documents, and process notes
  4. Expert opinion — what experienced people on the ground actually know and observe

Each of these comes with its own set of problems. Let's go through them one by one.

Real-Time Data — Too Much of a Good Thing

Picture a cement factory. There are sensors inside the kiln measuring temperature every single second. That's over 86,000 readings per day from just one sensor. Add sensors for pressure, speed, and fuel, and the numbers become enormous very quickly.

Now here's the problem. All that data needs to be stored somewhere. And storage has limits. So what do many systems do? They simply delete old data and replace it with new data on a regular cycle. That temperature spike from three weeks ago — the one that might explain why today's batch of cement is crumbling — is gone. Permanently.

Some systems try to save space by storing only averages and summaries instead of every individual reading. That might look fine in a weekly report. But if you're trying to investigate a defect, an average won't tell you that the kiln overheated for twelve minutes at 2 a.m. on a Tuesday. That detail matters — and it's lost.

In construction, imagine a project site where three different contractors are all logging data about concrete pours. One uses a spreadsheet. Another uses a phone app. The third hands in paper forms at the end of the week. When the site manager tries to put it all together, the result is a messy, patchy dataset full of gaps and contradictions. Getting a clear picture from that is genuinely difficult.

Lab Data — Cleaner on the Surface, Messy Underneath

Lab testing feels reliable. You have trained technicians, proper equipment, and set procedures. But even here, things can quietly go wrong.

Take a steel factory that tests the strength of its rods by pulling samples off the line. Standard lab procedure says to throw out the highest and lowest results before calculating the average. Seems fair. But what if that lowest result wasn't a mistake? What if it was a genuinely weak rod that made it through production? Removing it doesn't fix the problem — it just hides it. And hidden problems in steel can become very serious problems on a construction site later on.

There's also the issue of destructive testing. When you crush a concrete sample to test its strength, that sample is finished. You can't go back and retest it. If the lab conditions weren't quite right that day — wrong humidity, different technician, slightly off procedure — you're stuck with whatever result you got. There's no second chance.

Old Records and Expert Knowledge — Useful but Hard to Tap Into

Think about a construction company that's been around for 30 years. Their oldest records are in paper files. Data from the 2000s sits in an outdated software system. Recent projects are on a cloud platform. Trying to combine all three to spot long-term patterns is a genuine headache. The formats don't match, the definitions have shifted, and some records were kept for legal reasons rather than analysis purposes.

Expert knowledge is a different kind of challenge. A factory supervisor with 20 years of experience knows things no database captures — like how a specific machine behaves differently in cold weather, or how a certain supplier's materials cause problems in humid conditions. But that knowledge lives in her head and her handwritten notes. It rarely makes it into a formal dataset, which means analysts often miss it entirely.

So How Serious Is This?

Very. Research shows that missing data alone affects between 15% and 70% of datasets. Even one or two missing values in the wrong place can throw off final results by as much as 300%.

The Simple Takeaway

Always assume your data has some level of problems. Not because people are careless — but because real-world data collection is naturally messy. Knowing that from the start puts you miles ahead of anyone who just assumes the numbers are fine.

Excellence in knowledge and partnership
in your professional growth