Data Cleaning tips

They say that a majority of Data Science time is spent cleaning data. I suspect that partly it’s because many data systems are still maturing and need to go through a messy phase, and partly the universe is just a haphazard, chaotic, messy place that we’re trying to squeeze into little boxes of 1’s and 0’s.

As long as reasonable data cleaning time has been allocated to a project, I find it really enjoyable. Like, INSANELY enjoyable (especially because I use R to do it 😉 ).

Someone new to analysis asked me if I can share any tips, so here’s a few to squeeze the most value out of your data-cleaning time:

*Gets on soapbox*

  1. Cleaning is never truly “done” so figure out which work is going to have the most impact cost/benefit wise.
  2. Clean as close to source as possible (at the source of data is the best).
  3. Make notes of how to improve the system overall as you go (you’ll see so many niggly issues that you’ll forget them all).
  4. Keep a log of “to do’s” (the stuff you cant get to now but need to do a cost/benefit decision on later).
  5. If there’s A LOT to do and others need to see the progress of your work, have a shared list of the tasks and mark them as Done as you get through them.

This little bit of structure is going to add a lot of value to the process. Happy cleaning!

Want to practice data cleaning with a real-world, messy dataset? Check out:

Keep up to date with new data posts and Big Book of R updates by signing up to my newsletter. Subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12.

Once you’ve subscribed, you’ll get a follow up email with a link to your free copy.

Back to Top