Real World Data Cleaning: Why It's the Most Important Skill in Data Analytics

Tess Ogamba

5/1/20265 min read

I have worked with data for almost 8 years, and I can say with complete confidence that nobody prepares you for what real data actually looks like.

Not the courses. Not the bootcamps. Not Kaggle.

You can spend months sharpening your SQL skills, learning Python, mastering Power BI, and then walk into your first real data analytics job and find a column where someone has been typing their feelings into a date field for three years. And nothing in your training told you what to do next.

The gap between tutorial data and real data

Tutorial datasets are designed to teach you techniques. They are clean, structured, and logical. The column types make sense. The values are consistent. The relationships are clear.

Real data is none of those things.

Real data is a free-text field where people have been typing whatever they want since 2019. It is a URL where a name should be. A person's relationship status where an organisation name should be. Punctuation marks as the entire entry. The same organisation spelled fourteen different ways. Missing entries that nobody documented as missing.

I once spent two full days cleaning a single column ( a column of nationality entries) before I could use it in any meaningful analysis. Not because I was slow, but because the data was messy and inconsistent. Thirty years of different staff members entering the same information differently, with no shared definitions and no data validation at the point of entry.

That is not an unusual experience. According to a 2025 IBM Institute for Business Value report, 43% of chief operations officers identify data quality issues as their most significant data priority. And the financial consequences are significant. Gartner estimates that bad data costs organisations an average of $12.9 million per year, while McKinsey found that poor quality data leads to a 20% decrease in productivity and a 30% increase in costs.

Those numbers are not the result of sophisticated system failures. They are the accumulated cost of free-text fields, inconsistent entries, and data that was never cleaned.

The invisible work that makes everything else possible

Here is what the data industry talks about: dashboards, models, AI, machine learning, and data visualisation.

Here is what the data industry does not talk about enough: the hours spent before any of that is possible.

According to Monte Carlo, data teams spend an estimated 30 to 40% of their time handling data quality issues instead of working on revenue-generating activities. That is not a productivity problem. That is the job. The cleaning, the standardising, the validating, it is not a precursor to the real work. It is the real work.

The dashboard is the 10% everyone sees. The cleaning is the 90% that makes it honest.

What data cleaning actually teaches you

A senior data analyst commented on one of my LinkedIn posts recently and said something that stopped me mid-scroll:

"Data cleaning is where junior analysts quit, and senior analysts are made. The dashboard was never the hard part."

He is right. And here is why.

  • It teaches you patience. Real data does not care about your deadline. It will have fourteen spelling variations of the same organisation name, and it will wait for you to find all fourteen. The ability to stay methodical when a dataset is fighting you is not a personality trait; it is a professional skill.

  • It teaches you domain knowledge. You cannot clean what you do not understand. I once had to learn an entire service delivery model just to know which entries were genuinely invalid and which were just unusual. The cleaning forced me to understand the data in a way that no dashboard brief ever would have.

  • It teaches you judgment. When do you correct an entry, and when do you flag it? When is a blank genuinely missing, and when did someone deliberately leave it empty? When does "N/A" mean not applicable, and when does it mean nobody knew what to type? No algorithm makes those calls. You do. That judgment is the thing that separates someone who produces outputs from someone who produces insight.

  • It teaches you to trust your numbers. When you have personally worked through every row, you know exactly what that dashboard is built on. That confidence changes how you present, how you defend your findings, and how you respond when a senior leader questions your data. You do not hesitate. You know.

The AI question

One of the most common responses I get when I talk about data cleaning is: "But can't AI just do that now?"

The honest answer is: sometimes, partially, with significant caveats.

AI can help identify patterns, flag anomalies, and suggest corrections at scale. But even the most advanced AI systems are only as reliable as the data they are trained on, and poor-quality data fed into AI systems compounds errors rather than resolving them. The human who understands the context, knows what a valid entry looks like, and can make judgment calls about edge cases, that person is not being replaced. They are being assisted.

The data janitor enables the intelligence. That is not going away.

What this means if you are learning data

If you are currently learning data analytics and you are spending most of your time on visualisation tools, dashboard design, or machine learning models, that is fine. Those skills matter.

But if you have never sat with a genuinely messy real-world dataset and tried to make it trustworthy, you are missing the foundation everything else is built on.

Find a messy dataset. Not a Kaggle competition dataset that has already been cleaned and formatted for you. A real one, a public sector dataset, a charity's annual report data, or a scraped web dataset. Something with inconsistencies, missing values, and formatting chaos. Clean it. Document every decision you make. Understand why you made it.

That process will teach you more about data than any course module on advanced analytics.

The Data Janitor badge

The #datajanitor started as a joke, the unglamorous version of what we actually spend our time doing. But I wear it seriously now.

The analysts who stay through the mess, who do not skip the cleaning because it is tedious, who build the foundation before they build the dashboard, those are the analysts whose numbers can be trusted. Those are the ones whose presentations do not fall apart under questioning.

So proudly add it to your CV: Data Janitorial Services - Advanced Level. Because the dashboard is only as good as what it is built on. And what it is built on is you, in a spreadsheet, at 11 pm, working out why someone typed their postcode into the nationality field.

About the author

Tess Ogamba is a Data Analyst and data systems professional with 8 years of experience across research, NGOs, government, and the private sector in Kenya and the UK. She writes about data, governance, careers, and what real analytical work actually looks like.