DevSecOps Engineer, Michael Hogg, explores how chaos engineering has grown across disciplines – and is now taking on an important role in DevOps.

Why would you build something, only to try and break it? Breaking things on purpose hasn’t generally been common practice in the IT space. But that’s changing. As systems become more complex, potential risks and failures that could cause everything to come to a halt are increasing. What’s more, creating seamless, uninterrupted customer digital experiences are becoming vital differentiators for businesses in all sectors – from entertainment to retail – and even, more crucially, the public sector.

Intentionally breaking technology systems to gauge their resilience – known as chaos engineering – was first developed by Netflix in 2008. Read on to discover how chaos engineering has grown across disciplines – and is now taking on an important role in DevOps.

What is chaos engineering?

Chaos engineering defined: The discipline of experimenting on a system in order to build confidence in its capability to withstand turbulent conditions in production.

Netflix, the pioneer of chaos engineering, first developed the Chaos Monkey tool to “…pseudo-randomly pluck a server from our production deployment on AWS and kill it.” 1 Netflix reasoned that server failures are bound to happen and they wanted to ensure they had the capability to fix them during business hours, without customers even noticing. 

Next, they built Chaos Kong – as the name suggests, a bigger, badder version of Chaos Monkey – designed to kill an entire AWS region for Netflix. The company was able to identify systemic weaknesses, fix them and avert significant impacts; Netflix continues to run Chaos Kong exercises regularly. A senior chaos engineer at Netflix explains it in more detail here.

Based on these tools, Netflix developed a new discipline to apply across other projects and departments, coining the term “chaos engineering’. The Principles of Chaos Engineering is a living document published by Netflix so that other organizations can contribute their knowledge. 

Why integrate chaos engineering into DevOps methodology

There are tests. And then there are tests.

Several types of testing are typically performed in a development environment, including: 

Unit testing – Individual units, such as a component or small piece of code is tested to verify its correctness and validate that it is performing as expected.

Integration testing – Independently developed software modules are combined and tested to ensure they work together as expected – or expose faults in the interaction between integrated units.

Chaos engineering isn’t intended to replace these types of testing, but to work in harmony with them to provide a system with maximum availability. A key differentiator: chaos engineering is performed in a production environment – so the stakes are high. 

In a recent Gartner survey2, the top 3 priorities for CIOs were digital initiatives, revenue/business growth and operational excellence.These priorities are otherwise impossible to achieve if infrastructure systems are not adequately reliable. Further, Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

Integrating Chaos Engineering into DevOps Methodology

When chaos is introduced into any environment in a deliberate attempt to break things, it is intended to provide peace of mind that the service you’re offering is resilient. By integrating it in DevOps, you can build more robust applications to support the business.

 Here’s how to implement chaos engineering into a DevOps practice:

  • Establish a baseline – Define the “normal” steady state, using both technical and business metrics.
  • Develop a hypothesis – Spell out what you expect to happen in the chaos experiment; ideally a steady state system will continue to operate in both control and challenge groups. 
  • Introduce real-world incidents into a challenge group – Kill a critical process, terminate a random server, sever a connection.
  • Solve issues – Develop automated, self-healing fixes as problems arise. 
  • Continue attempts to invalidate the hypothesis – Identify differences between control and challenge groups and incrementally expand the confines of your hypothesis.

Chaos engineering in DevOps is valuable in helping your enterprise mitigate security risks, improve effectiveness of your IT team by providing deeper insight into how apps work, reduce maintenance costs and deliver a positive, more consistent customer experience.

Learn more about how DevOps and Cloudreach’s DevOps as a Service offering can help your organization with cloud transformation.