Chaos Is Good! — In Tech

Published in

Lifion Engineering

6 min readNov 3, 2021

We at Lifion by ADP, strongly believe that DevOps, Security Engineering and SRE best practices help make our systems not just agile but robust. Further to this, we are heavy users of AWS Cloud, and in our Kubernetes cluster we proudly host more than 200 microservices that power our award-winning Next Gen “Human Capital Management” (HCM) platform.

“Being proud” to be part of the infrastructure and reliability team responsible for such an important system is an understatement and this role comes with significant responsibilities to ensure we maintain our stakeholders’ confidence in our infrastructure and applications.

To help with this, we decided to adopt Chaos Engineering in Lifion and this is the focus of this post. We hope sharing some insights into our approach may help you in your organization too.

Photo by Michał Parzuchowski on Unsplash

How can we ensure our foundation is strong? — By randomly shaking some pillars — By creating turbulence — Welcome to Chaos Engineering

The Definition

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

The easiest way to explain Chaos Engineering is to compare it with a fire drill:

A fire drill is a method of practicing how a building would be evacuated in the event of a fire or other emergencies.

Just like a fire drill ensures our evacuation procedures, machinery and tools are intact before waiting for real havoc to happen, with Chaos Engineering, we ensure our Quality of Service (QoS) is intact by randomly introducing failure conditions, reviewing results, and addressing any gaps in resiliency.

The Environment

Before we get into the framework, tools, and “game days”, I would like to emphasize that Chaos Engineering makes most sense if we run it in the production environment. Our goal in applying Chaos Engineering was not to create downtime, but rather to ensure our reliability, scalability, and availability SLAs were intact. In other words, if there is a known instability then the first step is to fix it and we can then continue to use Chaos Engineering approaches to test from time to time and randomly to ensure confidence.

For our Chaos Engineering efforts to be successful we needed a neat framework with an easy-to-use toolset, a developer-friendly pipeline, reporting of outcomes, and last but not least a feedback loop to fix any new findings.

The Framework

When we started evaluating Chaos Engineering tools and frameworks a couple of years back, we had few viable options in hand that would meet our requirements.

We were looking for a toolkit that:

Aligned with Chaos Engineering principles
Had built-in support for AWS related fault injection such as AZ failure
Supports Kubernetes related fault injection, such as pod random kill, and out of memory
Supports Network fault injection, such as latency and HTTP 404 errors
Is extensible using custom shell scripts
Supports monitoring and observability
Has reporting
Is open source
Is light-weight and declarative

Some of the options we considered were Netflix Chaos Monkey, Gremlin, and LitmusChaos. However, Chaos Monkey seemed to no longer be very actively maintained, Gremlin was not open source and did not appear to have some of the functionality we needed at the time, and LitmusChaos was still very new. We ended up choosing Chaos ToolKit.

Getting started with Chaos Toolkit was very easy — write a JSON experiment, trigger the CLI, and boom we could get our first experiment running. This was cool for trying out and getting familiar with the toolkit. But how could we get the hundreds of our developers across our organization to buy into and adopt Chaos Engineering as a practice along with this toolkit? We decided to build a framework to help streamline their work and ensure it could support our engineering teams various use cases and needs and that it would integrate well into our specific environment. The framework needed to make it easy for engineering teams to write, test, run, fail, and fix their experiments for them to start and continue using it.

We built our framework around our existing ecosystem of tools, namely Bitbucket, Jenkins, and Artifactory.

We wrapped following in our framework:

chaos-runner: Base Chaos ToolKit + pre-packaged extensions — This created a single docker image where we take chaos toolkit as a base and then install required extensions to it such as AWS packages, Kubernetes extensions and more.

chaos-extensions — Chaos Toolkit comes with prebuilt extensions, but how could we integrate our organization’s ecosystem of tooling in it? We achieved this by writing custom modules using python. Some examples included probing metrics from our monitoring platform to determine system stability as well as integrating our custom functional validation framework APIs, and more. Basically, we were able to create our own actions, controls, and probes as we wished. It’s also worth noting that we included this in the Dockerfile shown above.

.├── common
│   ├── formatting.py
│   └── utils.py
├── actions
│   ├── aws
│   │   └── elasticsearch.py
│   ├── k6
│   │   └── actions.py
│   └── lifion
│       └── persistence.py
├── controls
│   ├── jira
│   │   └── tickets.py
│   └── platform
│       ├── lifion_aws_auth.py
│       └── resources
│           └── docker-compose.yml
└── probes
    ├── aws
    │   ├── elasticsearch.py
    │   └── route_53.py
    ├── http
    │   └── make_multiple_calls.py
    └── platform
        ├── availability.py
        ├── error_rate_percentage.py
        └── responsetime_95_percentile.py

chaos-experiments— In Chaos Toolkit, every fault injection is called an experiment. These experiments are written in JSON, as per the example below. An isolated repository is kept just for these so that developers need not worry about the base toolkit or custom modules and can focus on their experiments. The following example experiment is from Chaostoolkit examples:

chaos-scheduler — Easy-to-use, Jenkins-based scheduling for experiments, which can take parameters such as environment variables and VPC IDs. This is just a simple Jenkinsfile for every experiment. It supports one scheduler with one or more experiments.

With these components, we provided a well-defined framework for developers to easily adopt and start contributing to our experiment test suite.

Shared Responsibility Model

In our organization we have modular teams and a microservices architecture. We wanted each team to be able to own and manage their chaos experiments independently.

We defined the below process for onboarding experiments, where our team would own the framework, engineering teams were responsible for their own experiments, and then we would come together to run game days where we would run tests together in a given environment, address any issues, and iterate and improve our experiments with rapid feedback from the organization.

Summary

Adopting Chaos Engineering has been very beneficial for us. It has helped us efficiently identify and fix potential weaknesses in critical areas of our platform and client-facing system processes, such as payroll runs and disaster recovery testing, through controlled experiments before our clients may be impacted by real failures. Summarized below are some of our key learnings and takeaways, we hope you may have found this article useful and if this is an area of interest to you let’s connect — we’d love to hear your thoughts and experiences also!

Choose the tool that fits your needs — Now there are many options. For example, AWS has its AWS Fault Injection Simulator, and there are cloud-native tools emerging such as LitmusChaos and Gremlin.
Build a framework — Support your developers with an easy-to-use framework for writing and creating chaos :)
Share the responsibility — Let developers own their tests and help them to onboard new scenarios easily. Consider using game days also.