Hystrix To Prevent Hysterix

Dallas Monson Java, Microservices, Spring 2 Comments

Your application just launched into production after 6+ months of development and bug fixes. In true Agile fashion, you and the team released your MVP (minimal viable product). You did your best to simulate load and try to emulate actual user actions so that you can anticipate how your app will behave in the wild.

Now, however, you have thousands of actual users trying to break your app with every keystroke, gesture and mouse click. It seems that their sole focus for at least 8 hours per day is to try and expose the limitations and mis-steps in your code.

At least you know that the operations folks are more than happy to answer calls in the middle of the night to deal with database resets, cycling of your app and services, and they will most likely send you flowers for allowing them to work on the weekends troubleshooting your app because the end users are complaining of slowness and unresponsiveness. You are in trouble. If you don’t go broke buying coffees for the support folks, you will almost surely have your database rights revoked by ‘accident’ several dozen times by the DBA on call. Couple this with daily ‘check-ins’ from the product owner and your nice days of coding, are soon replaced with endless slogs of support tickets and war rooms for production downtimes.

How could something that started out with such good intentions, lead you to such a hell on earth? Simple – no thought was given to what happens after launch.

A common misconception of Agile is that you don’t have to plan, and that once you get things launched the work is done. But we are using Microservices so we are protected from this scenario, right? You ask yourself this question and you are quickly reminded that while you did go micro, you now have several downstream dependencies that your app relies on to perform all the stuff you used to code into your old macro app.

So, in this type of environment how do you avoid this very common scenario? There are some environmental things you can do like load balancers, clustering the database, and of course you can always throw hardware at the problem. None of these, however, are in your control as a developer/software engineer.

Luckily, if you code in the Java or Java Spring world you have the gift of Hystrix.

What is Hystrix you may ask? It is a collection of open source projects developed by Netflix that specifically focus on durability and reducing latency. If you live in the Spring universe, then things are even easier. Spring has folded the Netflix Hystrix project into the Spring Cloud and Hystrix-Javanica projects. Like everything Spring, these two projects are a bit heavier than the raw Hystrix library, but out of the box you get annotations and easy configuration. So, without further delay – let’s Hystrix to prevent these Hysterix.

What’s covered in this blog

This blog post will focus on the Spring/Java implementation of Hystrix, and will unfortunately not address these common problems in the .NET universe. However, if you are looking for a starter to Microservices and some of the patterns in this blog post, then Azure is where you should start. Also, please reference the blog post from fellow Keyhole Consultant, Jeff Hopper, on Microservices in .NET.

Some assumptions we are making in this post is that the examples all use Spring and leverage Maven for dependencies. We also assume that the concepts of annotations and dependency injection are understood by the readers.

In this post we will cover the following topics:

  • Taxonomy around Hystrix and SOA to help level-set
  • What Hystrix is and what design patterns drive its success underneath
  • How do we configure a Spring application to take advantage of all Hystrix has to offer
  • Some sample code to show how Hystrix can be implemented at the method level
  • Finally, a summary of what problems we are solving with Hystrix, as well some important considerations for implementation

Taxonomy

To help ensure that everyone is on the same page, let’s define some terminology or taxonomy, as it is used in this post:

  • SOA: Service Oriented Architecture and at a high level, defines some frameworks and patterns for services to communicate via messaging protocols.
  • Service: Services are activities that an application performs, for example maintaining a list of user and their permissions. In this example, this would be done by a user service that executes those actions to maintain the list (CRUD).
  • Microservices: A variant on SOA that emphasizes smaller, decoupled services to focus on one activity in contrast to legacy, monolithic applications that perform all activities for an entire application in a single, tightly coupled code base.
  • Spring: A Java framework that emphasizes dependency injection and a managed IOC container.

Below is a list of the technology resources referenced in this post:

Below is the list of design patterns referenced in this post:

And now, Hystrix

As our applications begin to mature and the distributed nature of them becomes more and more paramount, we bring more failure points into our system. Downstream dependencies, decoupled services, and distributed data stores all are potential pieces of the system that could fail. A failure in a downstream dependency in this architecture can cause cascading failures that end up crashing the entire system.

Circuit Breaker

In order to protect our applications from this sort of scenario, we can leverage a Circuit Breaker in our architecture. Circuit Breaker is a Microservices design pattern that helps applications be more resilient and durable, especially under heavy load and with several dependencies.

When the load on a downstream dependency becomes too much to handle and the calls to it begin to fail, the circuit breaker has a failure threshold that it tracks these failures against. Once the threshold is reached, the circuit opens. This prevents any further requests to the dependency until a time-out period has passed (default usually 5 seconds). Then a single request is attempted to see if it fails or succeeds. If it succeeds, then the circuit is closed and requests continue as normal and the threshold is reset. If that request fails, then the circuit opens again for another time-out period before retrying.

This will continue until the downstream dependency has recovered, or something has intervened in the process (i.e. another instance stood up, the service is cycled, etc.)

HystrixtopreventHysterix1

The Circuit Breaker is able to achieve these results by leveraging some other common design patterns. First, it wraps each call to a downstream dependency in a command, using the Command pattern, and isolates it in its own execution thread. Second, it keeps a counter of all the errors and timeouts for each thread to help manage each call in isolation. If a circuit opens for one of the threads, then that individual call just emits an exception that can be handled by the application, and does not cause the calling method to crash.

The implementation of the Circuit Breaker pattern we have chosen to review is a part of the Spring-Cloud project, Hystrix-Javanica, which is based on the Netflix Hystrix project. Hystrix-Javanica allows us to leverage the declarative nature of Spring annotations with very little configuration or manual coding, we can add Hystrix functionality directly to method signatures and to our entire project as a whole.

Fallback Method / Chain of Responsibility

The Hystrix project also allows us to implement a fallback method that gets triggered before the circuit opens, and allows the application to recover from latency issues. This fallback method is an implementation of the Chain of Responsibility pattern that allows a series of commands to hand off tasks when one is unable to complete it.

In this example, the fallback method is an alternative method with the same signature as the original command that usually returns a null result or an empty object of the same shape being returned by the original. If the original method is not responding in a configurable time span, then the fallback method is implemented to help the application triage requests and respond to requests in a reasonable timeframe, albeit with a null result. This is a preventative measure that can help the downstream dependency from being overrun with requests, and it can be implemented to include a retry.

HystrixtopreventHysterix2

The fallback method along with the circuit breaker allows us to build very durable, resilient services that run until they are stopped.

Spring – let there be Hystrix!

If you have ever used Spring you will know that it can favor annotations and Java configuration to take a lot of the pain out of setting up your projects. So without further delay, let’s show the config!

In order to leverage Hystrix you must do the following:

Add the following dependency to your pom.xml:

       <dependency>
          <groupId>org.springframework.cloud</groupId>
           <artifactId>spring-cloud-starter-hystrix</artifactId>
           <version>1.0.0.RELEASE</version>
       </dependency>

Add the following annotations to your AppConfig.java on the AppConfig class:
@EnableHystrix
@EnableCircuitBreaker

Alright, that was tough but we made it through! 🙂 By adding these annotations and dependencies to your application, you allow Spring and Maven to handle all the dependencies that are needed to make this happen. If it seems to easy, remember this is Spring and this is one of its core strengths.

To help with additional config, check out these two sources I used for reference. They’re helpful.

Hystrix in code – for reals

So we saw how easy it is for us to configure our app to use Hystrix, but how do we actually implement the circuit breaker and fallback methods in our code? Let’s start with implementing the circuit breaker.

For example, let’s say we have a user service that looks up users across multiple systems. Currently in our UserService, we have one internal repository that we use to look up the users that we control directly and we have API calls to three other services that look up the users for their respective systems. Our service then aggregates these into a single list and returns that list. Sounds pretty straight forward? Assuming you answered yes, here is what we have to do to implement the Circuit Breaker.

First here is our fake code as it stands, emphasis on fake to help illustrate the concept:

public class UserService{

	public List<User> getUsers(){
		List<User> users = userRepository.findAll();
		List<User> otherUsers = externalService.returnAll();
		List<User> moreUsers = anotherExternalService.returnAll();
		List<User> stillMoreUsers = yetAnotherExternalService.returnAll();
		users.addAll(otherUsers);
		users.addAll(moreUsers);
		users.addAll(stillMoreUsers);

		return users;
	}

}

The first thing we do to leverage Hystrix is take all of those calls and wrap them in commands. (Again this is contrived code, in better practice we would separate out the commands in their own files):

public class UserService{

	public List<User> getUsers(){		
		List<User> users = getUsersCommand();
		users.addAll(getOtherUsersCommand());
		users.addAll(getMoreUsersCommand());
		users.addAll(getStillMoreUsersCommand());

		return users;
	}

	private List<User> getUsersCommand(){
List<User> users = userRepository.findAll();
	}

	private List<User> getOtherUsersCommand(){
		List<User> otherUsers = externalService.returnAll();
	}

	private List<User> getMoreUsersCommand(){
		List<User> moreUsers = anotherExternalService.returnAll();
	}

	private List<User> getStillMoreUsersCommand(){
		List<User> stillMoreUsers = yetAnotherExternalService.returnAll();
	}

}

Now, we have to add a new annotation above any methods that we want to implement the behavior:

public class UserService{

	public List<User> getUsers(){		
		List<User> users = getUsersCommand();
		users.addAll(getOtherUsersCommand());
		users.addAll(getMoreUsersCommand());
		users.addAll(getStillMoreUsersCommand());

		return users;
	}
	@HystrixCommand
	private List<User> getUsersCommand(){
List<User> users = userRepository.findAll();
	}
	@HystrixCommand
	private List<User> getOtherUsersCommand(){
		List<User> otherUsers = externalService.returnAll();
	}
	@HystrixCommand
	private List<User> getMoreUsersCommand(){
		List<User> moreUsers = anotherExternalService.returnAll();
	}
	@HystrixCommand
	private List<User> getStillMoreUsersCommand(){
		List<User> stillMoreUsers = yetAnotherExternalService.returnAll();
	}

}

By adding that and separating the calls into commands, we leverage a couple things from Hystrix – the circuit breaker and the command pooling. Each method decorated with the annotation will now have a circuit breaker implemented on it and if the downstream call to the service begins to fail, the circuit will open and the failure will not cascade up past the implementing method.

Second, the command pooling that is built into Hystrix will allow the framework to spin up multiple commands if the latency on the return is greater than 10 ms by default (this can all be configured).

These two pieces right off the bat, give our sample application durability by preventing a cascading failure and allow the code to spin up additional calls to handle latency from any returns of our dependencies. Wow. That is a lot of power for not a lot of code.

Let’s take things one step deeper for our sample application and provide a fallback method to help return a “busy signal” to our calling client so that they can handle this information and display a message to the end user for example.

So, to implement this we need to add some more detail to our annotation and provide our fallback method. Also, for our example, we will just be returning null and assuming our front end will handle this and display the proper message to our user. We will implement a fallback method on our internal repository call like so:

public class UserService{

	public List<User> getUsers(){		
		List<User> users = getUsersCommand();
		users.addAll(getOtherUsersCommand());
		users.addAll(getMoreUsersCommand());
		users.addAll(getStillMoreUsersCommand());

		return users;
	}
	@HystrixCommand(fallbackMethod = “_getUsersFallback”)
	private List<User> getUsersCommand(){
List<User> users = userRepository.findAll();
	}

	private List<User> _getUsersFallback(){
		return null;
	}

	@HystrixCommand
	private List<User> getOtherUsersCommand(){
		List<User> otherUsers = externalService.returnAll();
	}
	@HystrixCommand
	private List<User> getMoreUsersCommand(){
		List<User> moreUsers = anotherExternalService.returnAll();
	}
	@HystrixCommand
	private List<User> getStillMoreUsersCommand(){
		List<User> stillMoreUsers = yetAnotherExternalService.returnAll();
	}

}

Notice that even though we are returning null, we set the signature return type to be the same as the calling method. This is a convention piece that is important for the fallback functionality to work.

With that small adjustment to the annotation, we now have a very robust application that can handle downstream failures of dependent resources, latency in returns from downstream resources, and a fallback method to help the front end communicate a “busy signal” to our end users when things get overloaded.

All this with about 10 lines of code. Not to get too excited, but that is kind of like accidentally finding $100 in your jeans right before you put them in the wash! Although, it is probably worth a heck of a lot more. 🙂

Summary

So, to bring everything around full circle, here is what we covered in this post:

  • We reviewed some terminology, technology and design patterns used in Hystrix.
  • We discussed what Hystrix is and what the circuit breaker pattern is.
  • We reviewed some example code to demonstrate how we configure and implement Hystrix in Spring.

I feel good about the time we spent here together and hopefully you found it informative and intriguing enough to look into how you can integrate a circuit breaker or Hystrix into your current projects.

I would like to leave you with some implementation considerations after implementing this in some projects that are in production currently:

  • Convention on fallback method names and Hystrix command names can make logging and bug fixing activities much easier, especially if you are not the resource doing front-line support.
  • Several people ask if they have to implement Microservices to utilize Hystrix, and the answer is no. Anywhere that you have a downstream dependency you can leverage this pattern and implementation to get benefits and advantages.
  • You need to consider what “having a service that will not die unless it is killed off,” means for your ecosystem. The main challenge we have noticed with this is if a downstream dependency does fail, you need to be careful that your service does not take it down right away when it comes back online. Especially if this is a data resource, in some cases we have had to add explicit steps for support resources to turn off or cycle these services before bringing data resources back up from a failure. There are a lot more design and architecture decisions that can drive clarity in this area, but that is another post…


That is all I have to say, for real this time! Thank you and please let us know your comments and feedback.

Dallas Monson
Software and Architecture Consultant
Keyhole Software


About the Author
Dallas Monson

Dallas Monson

Dallas Monson is a performance and results-driven Software Architect with a passion for refactoring and troubleshooting legacy and poorly performing code. Enjoys learning new technologies to enhance development and drive teams toward shared goals.


Share this Post

Comments 2

  1. The suggestion that the fallback return null or empty with the same shape is a bad one (speaking from experience of having gone that direction)

    If you don’t have a legitimate fallback, it is better to respond upstream with something that indicates you have an error state (we settled on HTTP 429) and let callers know they should retry with back off or themselves figure out what to do with no response from the service.

    Empty as a fallback means your service responses just become untrustworthy and people will implement retry on empty to verify that it’s really empty and not fallback empty

  2. That is a very good point that I should have called out in the example, that this was just to show the functionality and flexibility of the fallback, not necessarily meant to be a best practice. In our actual implementation we return a common response object shape with the correct http code and a more descriptive message that can be displayed to the end user – so very similar to what you suggest. I was more trying to emphasize the pattern and the conventions for the fallback, not the correct return – also because I felt that the returns could be more subjective – depending on the application. Thanks for the feedback though it was a good catch and needed more detail.

Leave a Reply