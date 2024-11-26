Netflix wasn't always the tech giant we know it as today. With its roots in shipping DVDs, and a now-notorious attempt to sell itself to Blockbuster in 2000, it's been a long journey to develop the cable-cutting streaming platform we all know now. In that time, Netflix has made a whole host of steps forward in technology, some of them proprietary and many more openly accessible.

But one of the more interesting, open, tech ideas to come out of Netflix is something you might not have heard of, or even noticed. Chaos engineering was essential for developing a stable, resilient platform at Netflix in the early days of streaming; and, whether you know it or not, it is keeping your content stable even now. But what on earth is chaos engineering - and why are monkeys involved?

What is chaos engineering?

And what do monkeys have to do with it?

Chaos engineering is a proactive approach to deliberately breaking your own services. This means deliberately injecting faults into various aspects of a distributed system, for example, surges in traffic or server crashes, with the goal of identifying vulnerabilities in distributed systems.

The idea here is that all distributed systems beyond a certain size inevitably experience failures - nothing can ever be 100% reliable. So, by proactively injecting faults, the designers of the system can test how the system responds to these failures and ensure that it's architected to recover safely without service disruptions. The goal here is to identify weaknesses in the system and improve overall resilience.

History of chaos engineering

Chaos engineering wasn't invented at Netflix, but they were one of the first major companies to adopt the practice widely. Amazon and Google both dabbled in chaos engineering, but on a smaller scale. Netflix's motivation for chaos engineering was caused by a widespread outage in 2008 caused by a database corruption outage (the original post-mortem is down, but there's still a write-up available from zdnet). This also led Netflix to complete a full migration to AWS around the same time, closing their own data centers.

Chaos engineering at Netflix

Move fast and break things - literally

As part of its chaos engineering program, Netflix built a tool known as 'Chaos Monkey,' which was responsible for randomly terminating instances (read: servers) in production, forcing its engineers to account for these sorts of breakages when designing their services. The idea behind this was to simulate the kind of common, high-level failure that can cause an impact to a customer and ultimately break platform immersion.

The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption (Netflix - via Medium/Techblog)

This idea was clearly successful, as Netflix continued to build out a full suite of similar tools for everything from simulating security intrusions, networking throttling issues, and general misconfigurations. This suite of tools was originally known as Netflix's Simian Army, with a range of other monkeys from a 'Janitor Monkey', which cleans up unused resources, to a 'Latency Monkey' which introduces artifical delays in service response times. Each of these monkeys was designed to target a different aspect of service resilience, helping Netflix handle a wide range of failures.

Netfix has slowly been modernizing and updating its Simian Army - some of these tools are now deprecated or rolled into other tools within Netflix.

Taking chaos engineering to the extreme

Now, if you think this sounds brave, I'd agree with you. Chaos engineering can have a huge impact to help companies avoid regressions and mistakes in the design of highly available, distributed systems. However, it is difficult to implement and relies on a very high engineering standard.

Netflix has taken this one step further since 2010 though, integrating its disaster-recovery testing into its chaos engineering program. With the introduction of a 'Chaos Gorilla' and 'Chaos Kong', there is a new set of tools capable of generating an outage in an entire region or availability zone of Netflix's main cloud partner, AWS. This allows Netflix to design for resiliency even past the standards set by its cloud providers.

Chaos engineering at Netflix has been a success

Do you remember Netflix's last downtime?

Netflix’s success with Chaos Engineering has inspired many other companies to adopt similar practices. Other companies like Amazon, Google, LinkedIn, and Microsoft have implemented Chaos Engineering principles to improve their systems’ reliability, and this branch of technology has helped to ensure that the services we use the most remain stable, even when things go wrong.

The philosophy behind Chaos Engineering extends beyond technology. It reflects a cultural shift toward embracing uncertainty and fostering innovation. Tech companies have long been embracing a blameless culture for outages, treating failures as opportunities for learning rather than as crises to be avoided. However, big tech has learned from other industries to design for an extreme level of resilience, such as in aviation and nuclear energy generation.

Chaos engineering might seem silly, but it has had a huge impact

The idea of deliberately breaking your own services can (and maybe should) seem a bit silly on the surface. You might be inclined to think that surely there must be a way to test these kinds of outages outside of production in less critical environments with far less complexity. This is more common outside the largest tech companies, where resources for in-production chaos engineering may not exist. But, by keeping this kind of chaos testing so closely tied to production, tech companies are ensuring that there can be no place to hide for bad engineering practices, preventing regression and forcing their engineers to build with scale and reliability in mind at all times.

So, while you might not have heard of chaos engineering, and almost certainly won't have noticed it, it's a little part of the secret sauce that keeps your favorite services stable - though we wouldn't recommend introducing it to your home NAS anytime soon.