13 January 2023
We’re nearing the end of a large upgrade to our data pipeline. It’s a big milestone for our team and certainly a career highlight for me. Of course it’s taken a lot longer than I’d hoped, but that’s the nature of doing new things that have a lot of unknown complexities.
I’ve been reflecting on what’s made it work well, what we could have done better and some lessons I’ll be taking forward.
Some of the first things that come to mind are:
1. Build time in for refactoring
Once you’ve developed a feature or portion of the pipeline, you can double the value you gain by having that code refactored (i.e. made simpler). Only once you know something is working and you’ve figured out all the bugs can you go back and take care of all the temporary fixes, the digressions and dead ends that you had to explore to get there. This will make future development much smoother and will lower the cognitive overhead of the people who are developing the pipeline.
2. First build the new pipeline, then port the legacy data over
We attempted to take the old data along with us at the same time we were building the new pipeline. This slowed down our development A LOT because we spent 80% of the effort trying to play whack-a-mole with issues that invariably turned out to be legacy data issues and not new-pipeline problems. In future, I will first build the new pipeline, then migrate the old data into the new system.
3. If there’s something that could save you development time in future, test it as soon as possible
Too often we had identified a possible method to speed up some part of the development process, but because we were halfway through some feature development we’d put it off and then jump straight into the next urgent development without investigating the time-saver. When we eventually did test it, most of the time we’d end up implementing it but had we done it sooner we would have saved ourselves more time.
When there’s time pressure to deliver something, it feels like a no-brainer to keep delivering what you know is needed rather than potentially waste time testing something that may not work, but looking back the time-savers that did work always made up more than enough for the ones that didn’t.
4. Have an overarching goal and framework in place
This was probably the biggest contributor to making our pipeline work so well and meet requirements on the first attempt. We have an overarching question which we’d like to answer using our data. This singular focus is what all our hypotheses, metrics and data collection comes from. This was all formalized in a framework, and with different teams all working independently but within the framework, means that integration happens seamlessly, there’s a common language and interface to components of our work and with relatively little feedback or review from the users of the data, the pipeline that’s been delivered was what the end users were expecting. From what I’ve seen and heard, meeting organizational needs on the first attempt is a rare occurrence.
5. Build as if everything will go wrong
I see this in many spheres of work where we (humans) tend to go about our tasks as if everything will go right. If we meet a few bumps along the way, we’ll fix them. The problem when you’re building a data pipeline (or anything where code, data and dynamism is at play) is that problems are absolutely unforgiving. They can’t be reasoned with, bargained with or bribed. Even a small error can perpetuate and cascade through the system causing errors orders of magnitude more numerous. Things I’ll always be doing in future: build in logging and notifications of key processes, build audit tables to track changes to data, build pipeline monitoring in from the start, document as you go (the holy grail!). Build all the scaffolding you need to catch errors, validate things before uploading, tell you what’s happened that didn’t go 100% as expected.
I’ll post more of these as I spend some more time thinking about everything I’ve learned over the last few years.
Keep up to date with new Data posts and/or Big Book of R updates by signing up to my newsletter. Subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12.
Once you’ve subscribed, you’ll get a follow up email with a link to your free copy.