How data gets cleaned – an example

Image courtesy of flickr https://www.flickr.com/photos/compacflt/34496068126/

Anyone interested in data analysis or the broader field of data science may have heard that about 80% of the work entails cleaning data.

About a year ago I did some basic analysis on South African Schools data, and I happened to document the cleaning process – mostly out of interest but also in case I needed to do this many times over, then the documentation would help me to develop a script to automate the process.

I’ll briefly outline the process in this post. This of course just one example of how one data set was cleaned – there are infinite approaches available.

What I would would like this post to do is:

  1. Show an example of what basic data cleaning might entail.
  2. Highlight that you don’t need fancy tools or methods – this was all done in MS Excel.
  3. Data cleaning may be both boring and interesting at the same time.
  4. Spark a conversation about how people document their data cleaning, and encourage the practise.

See this twitter thread. This was a recent discussion on the topic of how people document their cleaning and there’s some really good ideas and suggestions there.

The Data

South Africa’s Department of Education releases Master Lists of all the schools in the country. At the time I was doing the analysis (some time during 2017), the latest available data set was Q1 2016.

There are data sets for each of South Africa’s 9 provinces. There is thankfully also “combined” master lists for all provinces, but these have been split between Ordinary Schools, Special Needs Education Centres and Early Childhood Development Centres. It’s these 3 data sets that I was interested in.

Note: In this post, I’ll denote column names like this: [Column A]

 

Inspecting the data sets

So first thing off the bat, each data set had a different number of columns. This makes it difficult to simply append one data set to the next, because the columns might not match up and there might be different names for similar things.

Ordinary Schools had 52 columns, Special Needs had 55 and Early Childhood Centres had 33.

The number of records (i.e. rows) differed in each, although this isn’t a problem at all.

 

Summary:

National Ordinary Schools                            25 374 records

Early Childhood Development Centre     3 839 records

Special Needs Education Centre                465 records

Total Records                                                    29 678 records

 

Cleaning the data

The first thing I had to do was merge these data sets. I opted for picking the National Ordinary Schools as the “base” and then added the other data sets to it – just because the largest data set held most of the interest for me in any case.  I had to identify which columns were relevant in the other data sets, rearranged them in excel to match the Ordinary Schools data set and then copy and pasted.

 

Yes, there are better/easier ways of doing this – see my notes at near the end.

 

Here’s my documented cleaning procure, pasted as-is.

  1. Hand picked columns to match with National Ordinary Schools.
  2. [Province]:. Changed GT to GP for consistency. 994 replacements.
  3. [Phase]: Changed:
    • Combined to COMBINED SCHOOL. 4132 replacements.
    • Intermediate to INTERMEDIATE SCHOOL. 588 replacements.
    • Pre-Primary to PRE-PRIMARY SCHOOL. 3854 replacements.
    • Primary to PRIMARY SCHOOL. 18621 replacements.
    • Secondary to SECONDARY SCHOOL 6180 replacements.
  4. [Specialization], Replaced UNKNOWN with NULL. 1511 replacements.
  5. [OwnerLand], [OwnerBuild], [Ownership]:  Replaced UNKNOWN with NULL. 5 116 replacements..
  6. [Ownership]: Replaced UNKOWN (spelling error) with NULL. 18 210 replacements.
  7. [ExDept]: Replaced UNKNOWN and UNKOWN (spelling error) with NULL. 7477 & 6248 replacements.
  8. [Paypoint], [Component], [ExamNo]: Deleted all entries of “99”.
  9. [ExamNo]: Converted from String to Number
  10. [GISSource], Replaced UNKNOWN with NULL. 4408 replacements.
  11. [Magisterial_District], [DMunName], [LMunName]: Replaced UNKNOWN and UNKOWN (spelling error) with NULL. 3 120 & 9 replacements
  12. [ElRegion], [ElDistrict], [ElCircuit]: Replaced UNKNOWN and UNKOWN (spelling error) with NULL. 2 & 1231 replacements.
  13. [Addressee], Replaced UNKNOWN UNKNOWN (repeated word) and UKOWN (spelling error) with NULL. 4408 replacements.
  14. [Township_Village], [T_Village]: Replaced 99 with NULL. 2964 replacements.
  15. [Suburb], [Town_City], [StreetAddress]: Replaced UNKNOWN and UNKOWN (spelling error) with NULL. 5010 & 851 replacements.
  16. [PostalAddress]]: Deleted all entries of UNKNOWN.
  17. [Telephone], [Fax], [cellno]: Deleted all entries of “99”.
  18. [Section21], [Section21_Func], [NAS], [NodalArea], [RegistrationDate], [NoFeeSchool], [Allocation], [Urban_Rural], [Open_Boarding_school], [Full_Service_School]: Replaced UNKNOWN and UNKOWN (spelling error) with NULL. 40419 & 174 replacements.
  19. [RegistrationDate]: Deleted all entries of “99”.
  20. [Section21], Replaced 99 with NULL. 661 replacements.
  21. [Nodal Area], Replaced 99 with NULL. 7236 replacements.
  22. Deleted GPS co-ordiantes that were obviously wrong (using Tableau to visualise) – approx. 3 entries
  23. Corrected 2 or 3 entries GPS coordinates that were positive rather than negative values
  24. Replaced “99” entries in GS co-ordinates with NULL

I did all of this using almost-exclusive the handy “Find/Replace” feature in Excel.

Of course a lot more could be done, but this was good enough for what I wanted to do.

 

Analysing the results

After completing this data cleaning, I was able to visualise the data sets in a meaningful way. Visually analysing results is my preferred method of analysing anything. I think that humans visual processing capacity is so strong, that we lose a lot of insight if we jump straight to looking at numbers and tables.

I  used Tableau to do the visualisation – I’ve found it to be a great tool to quickly get some results that also look great. I was able to share what I’d found with colleagues and peers quite quickly after having completed the cleaning.

Here’s some of the results:

What’s interesting here is the distribution of schools doesn’t really match the population distribution i.e. the large amount of schools on the eastern coast is much larger than the population there would suggest. Also, the “combined” primary and high schools are certainly concentrated on this eastern side as well  i.e. KwaZulu Natal and the Eastern Cape.

 

There are also some interesting features, like how these schools follow the river.

 

I love using this one to catch people out / help them realise that data is useful for correcting misconceptions. Gauteng (GP) is by far the most populous province in South Africa, but only comes 4th in ranking for total number of schools. Everyone assumes it would be 1st.

I did some other analyses but I think you get the idea. Anyway, let’s get back to why I cleaned some of the data as I did.

 

Data cleaning detail

You’ll note than many records had “Unknown”  recorded and I changed this to a null (empty) value. This is because you often want to count the number of times a certain column has a record. If there is “Unknown” in the record, then that record will be counted because it has something in it, even though you may not want to count it.

Another common feature in messy data-sets is some sort of placeholder value. For example the number “99”.  There are many reasons why this may be so, but often its because somewhere a user or system is forced to input a value, and if one isn’t available then any number (usually easily distinguishable, thank goodness!), is used.

Often times you’ll need to analyse something by category. That’s why I changed the [Phase] entries to all be in all-caps because there was a mixture in the records. If I then wanted to analyse by category, then differently named, spelled or capitalised records will show up as different categories, even if they are the same thing e.g. primary_school and PRIMARY_SCHOOL will show up as two distinct categories.

 

Where coding can come in

All the above can also be done via scripts using something like R or Python, rather than using Excel as I did. I have enough Python skill to do so, and I may well use it in future. I might also just do the same thing in Excel again.  It really comes down to how many times I might have to end up doing this. Writing and testing a script takes time, but then running it takes just seconds – so it’s all about the balance. If I was going to do this on a regular or semi-regular basis (say more than 3 times) I would certainly script it.

Apart from figuring out what data needed to be cleaned and how to clean it, I find that sometimes messing about with Excel is a nice way to explore the data because it’s always in front of you in tabular form and you sometimes spot things that summaries etc will hide if you were doing it a more efficient way.

 

Another interesting tool: Open Refine

Probably a better way to do all of this would be to use a tool like Open Refine. I’ve not used it yet, and come to think of it, I probably will next time I have data sets like this. I’ve heard great things about it and watched a few clips of people using it. You can do some very powerful data cleaning without having to know any code.

 

I hope you found this useful. If you have any thoughts or would like to share your own stories, reach out on Twitter.