Software Engineering practices: Testing
These days, software developers are used to writing unit tests for their code. These tests ensure that every part of a codebase works as intended and provides a safety net that makes it easier to change code without accidentally introducing bugs. Higher-level tests are performed at the component and system level, while integration tests ensure the code works correctly when interacting with external entities. This allows developers to move quickly with confidence.
For all the same reasons that having a suite of automated tests is hugely valuable for any software application, we need to do the same for all artefacts of a data project. This means having tests for data schemas, data transforms, SQL queries and data algorithms. In fact, we should view every configuration artefact that’s part of a project as code, and test and deploy it with the same care as any other code.
Automating the system lifecycle
A data system consists of many parts, ranging from cloud infrastructures, such as storage buckets and databases, to code that runs and processes data. To build these systems, we don’t just write code, we also need to deploy it and provision the infrastructure on which it runs.
Too many systems rely on manual steps in order to build and deploy parts of it, or to perform maintenance like data updates and bug fixes. The result can be brittle systems where there’s a lot of scope for human error. It’s also slow and costly to maintain such systems – and this cost can grow as the system scales up to get bigger and more complex. To avoid such problems, we should aim to automate every aspect of the system lifecycle: building, deploying, provisioning, and maintaining all parts of the system.
Cloud platforms make this goal significantly more achievable. Every piece of infrastructure can be configured via code so that no manual steps are necessary. This means that, ideally, we should be able to recreate a whole system from scratch just by running code.