In a previous post I introduced a dataset that I think will give Data Analysts good practice in cleaning and analyzing real-world, messy data.
One such analyst reached out to me with some great questions and I’ve posted them here along with my answers. If you have questions of your own, you can email them to email@example.com and I’ll do my best to answer them. All correspondence will be anonymized :).
Subject: Data Cleaning Advice
I was cleaning the Salary Survey dataset and got really confused with the ‘salary’ column. Some salaries are too low or too high. Low numbers I replaced with annual minimum salary in that particular country, but extremely high salaries I have no idea how to clean. I don’t really want to remove all these rows, and can’t leave them like that either, as they will influence the result.
I attached a screenshot of the highest salaries in the USA. For sure, the Operations Manager can’t earn that much money 🙂
If you have any idea how to clean it, please let me know! 🙂
There’s a few ways to approach this, and as you gain experience you’ll feel less confused about what to do :). It’s worth noting that there’s unlikely to be one right or wrong answer – but some approaches that are more sound than others.
If a salary looks too low or two high, I would exclude it using a filter to filter out the results. There’s no rule that says you have to include all the data (and I would argue, the expectation with this sort of messy data is precisely to find what you should exclude). There are some conventions when it comes to excluding outliers, one of which is to calculate the Standard Deviation of each value, and then exclude any value which falls outside some multiple of that. Common conventions are to filter out any value that is 2x or even 3 times the standard deviation.
Rather than replacing too low salaries with a minimum, I might rather exclude them totally (see above). Personally I find it better to leave data out than to change data, unless I have a good idea of how to change the data. In this case, its unlikely that the people have mistakenly added the wrong value (hence changing the value is probably not a good approach), more likely they have not converted their value to the right period. Perhaps they put in a monthly, weekly or hourly figure (instead of annual). I think it may be too difficult with this dataset to tell which period they have used (in their minds). If I could make a good assessment of this then I might apply a conversion from monthly to annual value for example.
If you are doing this analysis in a scripting language like R, Python or SQL, then it’s good practice to add a comment about why you are adding the filter. This will help your future self and others see what your thinking is. If you are doing it in Excel or something else, then good to keep some notes about the reasons for your filtering etc.
Thank you a lot. It really helped. I have never used Standard Deviation 🙂
I calculated SD and AVG, add them, and removed rows which has salary higher than this. But, I still can see e.g. librarian earning tons of money(USD) a year. Should I leave it as it is?
If there’s still values that look too high, I would remove them and add notes about which ones to remove.
The approach when it comes to cleaning (in a work environment) is to decide how much time is worth the effort. Let’s say you were working in the space where understanding the data is super important, you will do lots of analyses on this and/or many people will rely on the results – then I might spend more time researching what an librarian might get paid for example and then remove it. In most cases, its not possible to spend so much time and you need to rely on your experiences to make the judgement call (in fact, those asking for the analysis need you to make judgement calls, otherwise they might just dump the data into some sort of automated analysis machine and blindly use the results :). Ultimately you need to be comfortable with the analyses, so if you think a number is too high, remove it. This is different when you get to a number that looks high but plausible (maybe a Wall Street executive?).
Sometimes you will just be dealing with really big ranges of data. I would say that salaries are not an example of this, but in the cases where this happens (let’s say, looking at net wealth per person where Bill Gates and the poorest of the poor are in the same dataset), then sometimes people start using log-scale graphs or analysis on the basis of interquartile ranges.
I got you 🙂
There are quite many suspicious rows like that (attached)I’m afraid deleting 20-30 rows will affect data insights in general. Data set has around 800 rows.
Is it safe to remove them?
Yes, I would remove them. You could also add a comment that it’s worth revisiting if you can get some info on salaries for MDs, Equity Partners and so on.
These are the type of decisions that there can always be debate about, but understanding your own reasoning is the most important thing. The point is that you are trying to get the best available answer within a reasonable time and cost.
thank you so much again!
Last question, what was the longest time you spent on cleaning a data-set? 🙂 Just curious
Hmmmm, good question and depends a bit on your definition :), but I would say a good day or two or even three for the type of dataset you’re looking at.
Keep your questions coming, they’re really interesting!
This question came from a reddit post:
Hi I’m relatively new to data analytics and I’m working on a portfolio project using data found here.
The country column (column K) is relatively dirty; it allowed manual entry rather than a dropdown country menu, so the entries are formatted inconsistently. These inconsistencies take form in misspellings and the fact that some people entered more detailed information in this field.
My questions are:
- For cleaning data regarding small things like misspelling, would Excel be the best way to approach this? My bias is that in my (inexperienced) opinion, it that it FEELS like Excel is the most “primitive” option to use, but it seems like it would be a nightmare to deal with in something like R or SQL.
- How often do you use your own discretion to interpret and correct what you perceive was the survey taker’s intended response? For an example from the dataset, if someone put down “Zimbabwe” as a troll answer, but the rest of the survey response seems to be within reason, would you delete this whole entry or would you just enter in “USA”, since the rest of the data suggests USA (they inputted state/city)?
Thanks, just want to understand the data analyst’s thought process when approaching such problems.
- I’ve found cleaning data like this is much easier in R. You can use a combination of case_when() and regex to cover lots of variations of things very easily, and I suspect there must be data cleaning packages precisely for this type of data.
- I weigh up how much effort the person would have put in relative to the troll effect. If everything else looks legit then I’d leave it but if not, then remove the record
PS bonus point for using something other than Excel is it’s much easier to plug a new or updated dataset in and/or expand and finesse different cleaning steps over time.
Newsletter subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12.
Once you’ve subscribed, you’ll get a follow up email with a link to your free copy.