They say that a majority of Data Science time is spent cleaning data. I suspect that partly it’s because many data systems are still maturing and need to go through a messy phase, and partly the universe is just a haphazard, chaotic, messy place that we’re trying to squeeze into little boxes of 1’s and 0’s.
As long as reasonable data cleaning time has been allocated to a project, I find it really enjoyable. Like, INSANELY enjoyable (especially because I use R to do it 😉 ).
Someone new to analysis asked me if I can share any tips, so here’s a few to squeeze the most value out of your data-cleaning time:
*Gets on soapbox*
- Cleaning is never truly “done” so figure out which work is going to have the most impact cost/benefit wise.
- Clean as close to source as possible (at the source of data is the best).
- Make notes of how to improve the system overall as you go (you’ll see so many niggly issues that you’ll forget them all).
- Keep a log of “to do’s” (the stuff you cant get to now but need to do a cost/benefit decision on later).
- If there’s A LOT to do and others need to see the progress of your work, have a shared list of the tasks and mark them as Done as you get through them.
This little bit of structure is going to add a lot of value to the process. Happy cleaning!
Want to practice data cleaning with a real-world, messy dataset? Check out: