Why you should consider Developing a Culture of Resilience

Russ Miles, eSynergy Lead Associate

Speed (of change delivery, i.e. agility) or Reliability, that has been the question. You can go fast, or you can go reliable (and secure), which is it going to be?

That was before research by the DevOps Research & Assessment program, captured in the wonderful book “Accelerate: The Science of Lean Software and Devops: Building and Scaling High Performing Technology Organizations” by Nicole Forsgren, Jez Humble and Gene Kim, showed that high-performing technology-driven, engineering organisations do not exactly make this tradeoff. 

It’s not Speed or Reliability, it is Speed and Reliability, or, perhaps more accurately, Reliability because of Speed.

Speed and Resilience

Going fast has traditionally been a minefield of trade-offs between security concerns, source code quality, predictability, scalability, as well as reliability itself! So what are high-performing teams and organisations doing to handle this veritable minefield? What will it take to be the type of technology-enabled business that eats others breakfasts, rather than being the caterer for others?

One answer is that high-performing teams are investing in a culture of resilience. This culture supports your speed of change by investing in organisational capacities that help you thrive and learn from inevitable challenges the system might face. You can go fast, in terms of speed of change, and be reliable and secure, through a culture of resiliency that embraces the reality of complex systems.

A Culture of Resilience thrives on Fail (Safe)

Attempting to go fast and never make a mistake is, with today’s highly complex and interdependent systems, a futile endeavour. If you set the principles of your socio-technical system’s architecture, design and processes based on “Never fail”, then you’re back to Speed or Reliable and Secure, so the first thing to realise is that resilience requires you to think a little differently about your systems.

In the dim and distant past I used to work on safety critical systems where failure is the norm. That might sound counter-intuitive, but in the face of complexity and reality it was just a fact: System’s fail. Designing to never fail was not possible, but designing to fail “safe” was.

The key was to develop the right signals and reactions as feedback loops within the systems such that people, with all their creativity, unpredictability and potential for good-intentioned miststeps, could safely work and inhabit these systems, while performing at their very best. Sound familiar? Whether we’re talking about the cockpit of a fast yet, or a team responsible for creating and operating valuable software services, the needs are startlingly similar.

Resilience Embraces the Real World of Fast, Complex, Socio-technical System Evolution

Resilience is your capacity to anticipate and adapt to surprising circumstances and it is essential because the alternative, anticipating and being completely prepared for all circumstances, is simply not plausible. 

In other words, “s*&t happens” even when we construct our systems as perfectly as we think we can, even with the best minds involved, and so in order to invest in speed, reliability and security we have to invest in resilience; resilience embraces the way the real-world works.

Compete through Investment in Resilience

High-performing technology-driven organisations are leveraging resiliency to help them compete in the marketplace by not slowing down their speed of change. Speed of change means better and better experiences for their users, and resiliency means that improved experience is not increasingly marred by problems with security and reliability. Resiliency doesn’t promise no problems, it promises that you’ll be much better prepared for them when they arise.

In a future article I’m going to explore how you can get started investing in resiliency by understanding the 4 resilience capacities, and then developing the 7 resilience properties for high priorities such as reliability and security.

In the meantime, I wholeheartedly recommend checking out this resource by Lorin Hochstein on GitHub for some excellent background reading and research on engineering resiliency.

The Author

Russ Miles, a Lead Associate of eSynergy, is on a mission to help organisations establish agile, reliable, secure and, ultimately, resilient and humane socio-technical systems that enable all stakeholders, from the users and customers to the builders and operators, to thrive inside and outside of those systems.
 
Russ is currently a lead engineer with Segovia Technology at Crown Agents Bank where his team develop the payment and foreign exchange systems that help incredible organisations such as the UN and Save the Children distribute much-needed funds to hard to reach countries and markets.Russ is co-founder of the free and open source Chaos Toolkit project. He’s also an international consultant, trainer, speaker, and author. He is a recognised expert in Chaos Engineering and has contributed to “Chaos Engineering: System Resiliency in Practice” from O’Reilly Media as well as having written “Learning Chaos Engineering”, also by O’Reilly Media, where he explores how to build trust and confidence in modern, complex systems by applying chaos engineering to surface evidence of system weaknesses before they affect your users.

Russ can be reached on Linkedin and on Twitter.