This is a draft chapter of a guide I’m writing to help people hired as the first “data person” in an organization. To make sure you don’t miss any new drafts or the release of the full guide, sign up to my mailing list.
This is the type of post that I will keep adding to over the years, consider this V1.0 :). I think it may warrant its own book at some point.
If you’re starting from scratch, it can be really tough to know where to start or what to prioritise. You’ll have a natural tendency to start implementing data systems where you’re strongest and I think that’s as good a place as any to start.
Below I describe a broad framework for all the aspects of data that an organisation will need to consider. I suggest you use this as a starting point and add or remove things as needed. This core set of elements will probably cater for 95% of what any small to medium sized organisation needs to think about. I don’t think this applies to large enterprises, and I doubt there are any that still require a “first data hire”.
It can be really easy to get carried away or overwhelmed by the list of “to do’s” that you’ll generate from this framework. The idea is to make a pretty exhaustive list of what needs to be addressed but then to prioritize the most valuable and critical ones. There are others that won’t warrant any actual work in your situation, but everything needs at least some consideration and then a decision of “Nope, we don’t need this”, or maybe a few sentences documented somewhere of how and when this might be addressed more fully later on.
The framework follows the flow of data i.e. Collection, Analysis and Communication. The underlying themes of Governance and Training & Support are also detailed.
Alright, here we go. I’ve presented some starter questions and considerations to spark your own! Most of the time you’re also the one who’ll need to answer these questions, not just ask them ;).
Collection & Storage
Where is our data coming from? How is it generated? Is it once-off or continuously created? Do the sources change? Do we want to store all of this or is there a reliable third party source we can call on? Is it spreadsheet based or coming from online forms?
Do we have a database? What state is it in? Which cloud provider should we use? Does it need to be cloud based? Ok so if Excel is our database, how can I improve it? Can I make the case for a better system? How will we move our data to and from it? Who needs access to this internally? Do any other organisations need access to our data? What should the schema be? Does this work for the majority of the analysis we need to do? What about backups? Who should get which access permissions? How will I monitor performance? Should we have API access?
How reliable is this data? Are there definitions for variables? How do we know the data adheres to them? Is there a way for us to check consistency? Do we have known consistency or quality issues (you definitely do have issues, but do you know what they are?)
Processing & Analysis
Do we need to do any processing of our data and where is that stored? How will we process it? Via SQL queries or in Python/R scripts?
Statistical / Analytical methods
What software and tooling should we use? Open source or proprietary software? Are we staying with high-level analysis? Is basic analysis sufficient? Do we need more robust and in-depth methods? Do we need to employ any machine learning and if so, do we have the right data quality to warrant it? What are the simplest methods we can get away with? Do we need to be able to reproduce results? If so, how should we do this?
When should I automate some processes? What can be automated? Will the automations themselves require maintenance and if so, who will maintain them?
Is our organisation ready for self serve analytics? Will it provide value? What is the level of data literacy in the company? How many excel sheets are flying around? Who will build and maintain these systems?
Reports, dashboard, graphs, visualitions, tables and answering questions. What are the ways people might need this data? Are we building interfaces to other systems?
How do we field and screen the need for things like dashboards and reports? How can we MVP some of this? Who are our internal and external audiences? What else do they read and share? Are we getting asked follow-up questions? Are stakeholders disinterested? Why? What do they care about? What concerns them? What about automated alerts and notifications? Might they be better than reports and dashboards? Who needs them and why? Who needs to act on what information?
Policies and procedures
How long do we store data for? How do we share data, and with whom? Does anything need to be anonymised and how much anonymization should we apply? Are we GDPR compliant? Do we need it to be? Do we maximize or minimize the type of data we gather and why? Should we vet where we get our data from and who we share it with, if anyone?
Do we have a data dictionary in place? How do we maintain it? Who gets access? Do we need a query library (you do!)? How do we decide on standard vs ad hoc queries? Should we have controls in place for deviating from any standards we want in place? Where are our policies stored?
Training & Support
How can I help the rest of the team with their queries? How can I manage the workload? When do I do things for people vs training them to do it themselves? What level of data literacy and analytical proficiency do we want or need? How far are we from that? Is this a priority to upskill anyone or will we hire in as needed? What about their current role? Can we empower people with a little bit of knowledge and skill? Are there small wins that make the rest of our systems easier to build and maintain?
These elements should give you a good basis to work from. Some you’ll start implementing right away, others you might only come back to in a few years time. It’s worth reviewing these perhaps once a year to see if you’re on the right track.
That’s it for now! What have I missed?