A real-world, messy dataset to practice on

At some point you may be looking for a “real world” dataset to practice analysis on or to give to students.

The value of such data is that it gives analysts a chance to develop skills they need for their work, but are hard to master when given “clean” datasets, especially inside a guided course.

I’ve found this dataset below which, apart from being actual, real-life data, has a few characteristics that makes it a good set to learn about data cleaning and then further analyzing.

The dataset

The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs. I find salary surveys inherently interesting, but here’s some other notable aspects of this dataset.

A spreadsheet showing salary survey responses. Column headers include age, industry, job title, annual salary, currency, country, city, years of experience, gender and race amongst others.
  • There are 17 variables, so its not too overwhelming
  • 6 of the variables are free-form text entry, which always results in lots of data cleaning to be done!
  • All variables make intuitive sense you don’t need any domain expertise to figure out what they are HOWEVER….
  • You can apply some domain expertise to a subset of the data that you are familiar with, be it country, state, job title or sector knowledge.
  • The dataset is “live” and constantly growing. In the time it’s taken me to write the first lines of this post, the responses grew from 11,588 to 11,603. This means that fixes you made to earlier analysis may not hold for all new entries.
  • When downloading the dataset, there’s also a “timestamp” variable (column A), so you can simulate a growing list by filtering data by longer and longer timespans if it’s no longer receiving any updates.

If you’re using R, you can read the sheet using the googlesheets4 package.

You can of course make a copy of the sheet directly in Google sheets, or you can download it in multiple formats.

File menu from google sheets showing the dropdowns to download the data which is in order: File, Download and then a  selection of format options like XLS, CSV etc.

Questions to start with

If you’re not sure on what to start analysing, here’s some sample questions to get you going:

  1. Which industry pays the most?
  2. How does salary increase given years of experience?
  3. How do salaries compare for the same role in different locations?
  4. How much do salaries differ by gender and years of experience?
  5. How do factors like race and education level correlate with salary?
  6. Is there a “sweet spot” total work experience vs years in the specific field?

Some Q &A

An analyst emailed me with some data cleaning questions and I’ve posted my responses. You’ll get a good idea of how I would approach such a task.

Happy analyzing!

Ready to learn more?

📃 Ever been given a dataset and you’ve got no idea where to start? Here’s a checklist for you.

📃 You’ll spend a lot of you time cleaning data. Here’s 5 data cleaning tips to make the most of that time.

Newsletter subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12.

Once you’ve subscribed, you’ll get a follow up email with a link to your free copy.

Back to Top