Failure As A Use Case: Trouble Maker

Attention: The following article was published over 8 years ago, and the information provided may be aged or outdated. Please keep that in mind as you read the post.

For too many reasons to count, it is nearly a guarantee that your production software systems will fail in some way.

We attempt to QA our software, which essentially tests application “business” function, but it’s very difficult to test what can go wrong in an actual production environment. Things like memory utilization and leaks, port exhaustion, connection pool timeouts, too many resource file handles, and numerous others.

More potential issues can be introduced when distributed systems such as Microservices are adopted, as the entire system has many more moving parts than a standard monolithic web application. Service registries, load balancing and failover, and redundancy are essential, so there is even more surface area for these types of potential failures. Handling these types of failures is a characteristic of system stability.

So, how can systems stability be tested and validated outside of production?

Well, it’s difficult (and expensive) to do this. Having to create, maintain, and then apply similar usage and loads to emulate production is complex and costly.

Our Solution

Our answer: engineer failures into your platform’s production environment.

Sound crazy? Maybe. But Netflix has done this using a framework called Chaos Monkey, which can be configured to randomly take down AWS resources (i.e. load balancers, etc.) during normal business hours. When this occurs, automated or manual procedures should occur to remediate problems, while still continuing to operate and serve users.

So, consider this thought experiment: if you know that system failures are occurring, and you are not getting help desk calls and your pager or alerts aren’t pinging you, then your system is stable and durable.

Introducing Trouble Maker

Netflix’s Chaos Monkey is based upon Amazon EC2 API. So, we wanted to implement a solution that was not dependent upon the cloud and could be used within an enterprise environment.

Trouble Maker was implemented for Java-based web and Microservices-based applications. It will randomly take down application services and provides a web console to perform stability tests against servers. Here’s a diagram of how it works:

Trouble Maker Diagram

Trouble Maker communicates with a servlet registered in Java-based client Microservice. It also communicates with a Service Registry that is used to determine location of services to operate against. (By default: Eureka.)

Here’s a screenshot of the Trouble Maker dashboard:

Trouble Maker Dashboard

Random Kill

Trouble Maker can be configured using a cron expression to randomly select a Java app server instance and kill it.

Doing this in production might seem a little risky, but if you desire a stable Microservices platform, this will tests its durability. Truly understanding its durability will allow your team members to enjoy their time off and recharge their batteries so they can come back to work and implement killer applications.

Open Source Project

Trouble Maker is an open source project hosted on Github that can be found here: https://github.com/in-the-keyhole/khs-trouble-maker.

Please feel free to make any suggestions or submit pull requests. Our goal is for this to help organizations that are implementing Microservices to implement stable and durable platforms.

— David Pitt & The Keyhole Team