Hystrix Fallbacks – Protect yourself and wrap them in a lambda

When building microservices, it’s always a good idea to build in some capability to circuit break on your downstream dependencies whenever they are experiencing issues, or perhaps networking problems that cause delayed response times.

One utility that is fairly common is Hystrix, built by Netflix. This post is not going to detail how to use it, nor how to configure it. However, recently we’ve found what seemed like some interesting behaviour where our http requests were taking longer than we expected for calls that were failing on our downstreams.

This post is an attempt to explain how you should write your HystrixCommands, and in particular, your fallback methods, and why you should care.

TL;DR

When creating HystrixCommands, wrapping your fallbacks in another HystrixCommand isn’t a silver bullet.

Example code can be found on GitHub.

If you’re running IntelliJ IDEA, I’d recommend the GrepConsole plugin if you haven’t already installed it. It’ll make correlating the log output and the diagram a little easier. Further details are in the README.

Setting the scene:

We noticed unexpectedly longer response times from our microservice whenever our downstreams were failing due to slow network traffic. In the case of any failures, we’d written our HystrixCommand’s to look up a previously stored backups from the database. Upon inspection, we found that the fallback method in our HystrixCommand’s were just making direct database calls via our DAO (and subsequently through the driver). And as it turns out, doing so presents two major problems:

1. Indeterministic execution time in the fallback

Once the fallback method is executed, it will not be managed by a timer (however it could very well execute on a timer thread), and subsequently it can affect the response times for the initiating caller. i.e. fallbacks are not bound to be interrupted like in the run() method. Wrapping the fallback call in a HystrixCommand seemed for us, like the right thing to do, but from our testing, this isn’t necessarily the case. More on this later…

2. Slower and slower calls when timing out.

When the fallback is executed because of a timeout, the fallback will be executed on a HystrixTimer thread. A little gotcha is that there are only as many HystrixTimer threads as there are CPU cores. So on a CPU with 4 cores, you’ll only have 4 HystrixTimer threads to execute fallbacks in the event of timeouts. I stress the timeouts because in other circumstances, the fallback will either execute on the initiating thread, or on the Hystrix worker thread, which both have a configurable thread pool – but your timer threads are not! So, if your timer thread is responsible for executing your fallback, and it becomes held up with lengthy executions, then it’s unable to service any other timeouts. This is of course will only really become apparent when you have many concurrent requests, with many downstream failures. You might be thinking this isn’t a great problem, because surely the circuit breaker will kick in and then all fallbacks will execute on the initiating thread, so its only an edge case? You’d be correct, but think about the fact where you might have many downstreams, all of which are timing out because of slow network issues, then this problem becomes compounded until your circuit breakers are enforced!

Below is my attempt at trying to articulate the threads that your command is executed on in particular circumstances.

Green = Initiating thread
Red = HystrixTimer thread
Blue = Hystrix worker thread

hystrix-thread-execution

Now I’d like to revisit the comment I made earlier about our assumption that wrapping the fallback into a HystrixCommand call would solve our problems. Unfortunately, it doesn’t really. In fact, I think judging by the sample tests, it probably makes things a little worse! As you’ll see, there are many cases where the result returned to the client was:

Should have timed out by this point!

The reason we get this as the return result is because the HystrixTimer thread never got around to interrupting the running execution because it was held up by other tasks.

The conclusion

I think by now, it’s become pretty clear that when your fallback executes, it should do so as quickly as possible – so that it can release the HystrixTimer thread to do its stuff. One pitfall is to assume that wrapping the fallback in a HystrixCommand would save the day, but in reality, it actually uses up two timer threads which will block until the call stack unrolls.

There is however another interesting alternative to avoid the HystrixTimer thread from being blocked. This is where Lambdas come to the rescue – well, partially at least. If you look at the SlowRunAndFallbackExecutedOnCallingThreadCommand you’ll notice that instead of it returning the standard String type like the other commands, it’s instead returning a Result<String> which is how we’re going to leverage the deferred fallback behaviour. By returning a Result<T> object we’re standardising the response we get whether it was from the  run() or the fallback() method. In the fallback, we create a DeferredResult<String> (a specialisation on Result<T>) object which is created with a Java 8 Supplier  lambda, and this Supplier forms the unit of work which needs to complete, and will only be invoked when the result is actually needed, which in this example, is when we’re back on the initiating thread, and out of the Hystrix “world” of threads.

The downside of course is that if your fallback is a long blocking call, there is no alternative way for you to deterministically time it out other than doing that yourself inside either the DeferredResult object, or wrapping the DeferredResult<String>.get() method  yourself in something like a Future which could then be timed out. Both of which aren’t the most elegant IMHO. So this isn’t the silver bullet to the problem, but it certainly avoids the tying up of the HystrixTimer threads, and bring the control back onto your initiating thread.

Don’t fall into the trap of thinking you can do something like xxx.queue().get(1, TimeUnit.SECONDS), for example, as all that will do is duplicate the work that the HystrixTimer thread is effectively going to do anyway, the only difference is that you’re being more specific, and it still will have no effect on the time taken in the fallback!

I certainly cannot take the credit for this approach. In fact, this solution had never really occurred to me until it was mentioned to me by a rather smart Paul McGuire. I think this is a pretty neat solution – but go check it out for yourself!

Hopefully this has gone a little way to shedding some light on how Hystrix chooses to execute commands on what threads, as I couldn’t find any mention of this sort of behaviour in the docs. I’ll certainly be more thoughtful of the next HystrixCommands I write to take this sort of behaviour into account.

Lessons Learnt This Week…

I still surprise myself sometimes when I come across a bit of weird behaviour in code, or some strange anomaly which I’m attempting to debug, when I find out the root cause and the answer, I only then think “Oh, hang on a moment, I knew that!”

I had two such moments this week:

  1. Never create a Unique Index over nullable columns in a database – particularly MySql/Postgres. It won’t function as you might expect it to. You could potentially end up with duplicates because the underlying database engine doesn’t regard nulls as actual values, so thereby essentially ignores them in the unique index. Consider the following table, with a Unique Index over all the columns:
    FIRST_NAME(NOT NULL) LAST_NAME(NULL) AGE(NULL)
    John
    John

    These would be allowed values, even though you’re expecting NULL‘s to be included into the unique index. Beware of this kind of assumption – it will bite you later. If you’d like to know more re MySql, then head over here, otherwise, here is the snippet off of their manual:

    A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row with a key value that matches an existing row. For all engines, a UNIQUE index permits multiple NULL values for columns that can contain NULL.

  2. Never expect BigDecimal.ZERO.equals(anotherBigDecimal) to return true, even if you’re comparing 0 with 0.00. The reason being is that the equals method takes into consideration the scale of the number you are trying to compare. So even 0 is NOT equal to 0.00! In cases where the scale size might be different, but the actual value itself

Domain Driven Design – [Part 3]

Continuing on in the series of me trying to unearth an approach to integrating my domain to make use of external services, I’ve decided to come up with what I so elegantly dub the “Approach 2”. If you’ve forgotten about what we’re trying to do here, then head on over to my first post to refresh the grey matter.

Approach 2: Implement a Service Locator

So given approach 1, we want something that isn’t as fastidious in the requirement of passing in interfaces for the things we want done. It’s at this point I began digging, and came across the Service Locator Pattern.

This approach abandons the previous requirement to pass in the interfaces, but instead, is replaced with the Service Locator pattern which the domain object uses to look-up the appropriate service, and call the appropriate method(s) on it.

Advantages

  1. Callers of this method do not need to concern themselves with what interfaces they need to pass in.
  2. If in the future we decide that this method needs to interact with other services, then we can change the method body to include the look up, and there will be not changes to existing callers because the method signature will stay the same.

Disadvantages

  1. The very strength of this approach is its weakness too. There is no real way of knowing what external services the order.placeOrder() is going to call on – if any at all! You may not really even care, but the Service Locator pattern approach could get really abused, and you could soon find yourself using the Service Locator to look-up infrastructure services and call them, and in so doing, muddying your bounded context of the domain model. Careful thought needs to be given if this approach is to be adopted…
  2. Testing proves to be challenging. If you’re using something like Spring, the right thing to do would be to inject the application context into this Service Locator, and let the Service Locator delegate service look-ups to the Spring loaded application context. You’ll also find that you will need a testing equivalent Service Locator which gets used for running your tests because you don’t want to load the entire spring application context just to ensure that a particular service class was looked up and called correctly in your unit test.
  3. Another fly in the ointment about this approach is the fact that your domain object is now responsible for more than just itself. It has to look up the relevant service(s) it wants to use, which implies that it has outside knowledge of what services exist, which is arguably not something it should really know about. This also begins to break the rules of single responsibility.

Now for some, this approach may look quite clean and simple. I have my own reservations about it, and quite frankly, I’m not satisfied with it. In fact, I’m actually on another mission to find a better approach. I think I have one, but only after much more understanding about what it was I actually needed to achieve, and through a fair amount of trial and error managed to come up with what I think is a reasonable enough approach to tackle external service calls and your domain objects.

Code for this blog post can be found on GitHub here under the module of sample2

Join me in my next post to find out how I begin to re-arrange a few things, to gain some better insight into what it is I’m actually trying to do here.

Until next time!

Domain Driven Design – [Part 2]

If you can remember from the previous post, we are trying to model a simple order/order item scenario to bring about a recent challenge I had about how to code my domain so that if external services need to be called, how do I integrate those calls/interactions with my domain.

Approach 1: Pass the interfaces into the method call.

This requires passing in the interfaces of the external services into the method call which will update the order item state.
The code for this example can be found here: ##LINK. The module is Sample1

Advantages

  1. It is very descriptive, and there’s no mistaking that the action requires external services.
  2. It’s easy to test. You could pass in mocks of the services into this method call and verify they were called as is expected.

Disadvantages

  1. Every time we introduce additional external concepts to the domain objects, we may need to add another interface to the method signature. In all fairness, we may probably have to rethink about how the model is designed, but in terms of disrupting the existing callers of this method, we’d be changing how they interact with the domain object – and should they even really care? Why shouldn’t the callers just be able to call order.placeOrder() without passing in all the interfaces needed? An example would be, what if we were to raise an event to another subsystem that needs to be notified when the customer is notified? If adopting this approach, you would need to add another interface, of re-use the existing notification service interface. However doing that would mean that the service is taking on multiple responsibilities – sending emails and raising events.

Domain Driven Design – [Part 1]

I’ve recently been reading Eric Evans’s Domain Driven Design book, and it’s made me seriously reflect on my development experiences. All to often in majority of the projects I’ve been involved with, there is this, almost automatic approach, to just develop software in a very technically layered manner, without any real regard to domain modelling and enriching that model with business functionality. i.e. you have many layers in your application(Facade, Service, Repository, Managers etc), with one main “layer” responsible for most of the domain type logic, and your “domain” classes are nothing but anaemic glorified data containers!

I don’t think that we as developers just automatically assumed this kind of development practice. I think we might have been surreptitiously conned into it by past languages and practices which possibly reinforced bad design. For example, I remember back in the day when Visual Basic came onto the scene and introduced the notion of property sheets, and public getters/setters – and then subsequently the advent of JavaBeans which followed the same sort of approach which in the end, I feel perhaps could have influenced software developers over the years to accept that anaemic objects with exposed internals were OK. Quite frankly, after reading about DDD, this is just plain wrong!🙂

I’ve discovered that there is an increasingly growing community of DDD advocates out there, and I can’t thank those of you enough, who have shared their experiences with me, and helped me to learn more about DDD, and how to effectively introduce it into a project. As with most software development concepts, when learning them, there is always an abundance of sample code out there as well as opinions, and quite frankly, majority of the time, there are never those samples which manage to convey enough complexity to fully understand design challenges or pitfalls. I find when in this type of position, StackOverflow is usually my friend, but sometimes that can just be a source of conflicting information and advice – not the greatest when you’re trying to learn the correct approach.

When I began to actively introduce DDD into my projects, I came up against a problem of allowing access to outside application services from within my domain model. Huh?! I hear you say. Well I mean, I had arrived at a point where my domain objects – which are now enriched with business logic and rules(data and behaviour) – needed to access outside services which do other related operations on the domain objects’ behalf. My question was how do I go about accomplishing this in the most strictest of DDD terms?

This is what this series of blog posts will try to offer by means of samples showing my journey of the many different approaches I’ve attempted at solving the problem. All these approaches will be available on Github. These samples are just a journal of my evolving understanding of DDD. Feel free to take from them what you will, but I do not profess them to be the silver bullet when faced with the same challenges. As with each of these sample approaches, I have tried out, I have experienced their pain, and their joy.

The “real-world” problem

For the purpose of discussion, lets assume we have the concept of an order made up of multiple order items which basically represent a customers’ order of items from a shop. To make this problem a little more digestible, I’ll attempt to express it as a user story.

As a customer
I want to place my order, with the respective supplier(s)
and be notified whenever any order item is ORDERED
So that I know when the order item is on its way.

Acceptance Criteria

  1. When placing the order item with the supplier, if the supplier responds with ORDERED, then the customer must be notified immediately via email.

Obviously in this contrived sample, we’re missing other blatantly obvious business actions, like taking payment etc., but to keep the focus on the problem at hand, we want the focus to remain on how the action of placing an order makes calls outside of the entity itself, and how it raises notifications. All this story is about is making notifications based on results from external calls, but how does one design this? That is what we’ll see evolve over each of the samples.

When each line item is sent to the respective supplier, the supplier may choose to fulfil the order immediately and will respond with the result of ORDERED. In such a case we need to notify the customer of this state via e-mail. The supplier may not fulfil the order immediately, in which case they would respond with the result of PENDING. The supplier may choose to tell us at a later date that it is fulfilled – but we won’t be covering that situation here. The business requirement is that when the order item transitions to the ORDERED state, then we need to notify the customer.

Now, to highlight the problem, lets say I have a method in my order item which will place the order on my behalf. However this has to occur through an external service. On receipt of the result of that action, we need to determine additional operations – i.e. possibly sending the email to the customer. How does the domain object here(the order item) gain access to the services responsible for carrying out this work? I’m talking about both the placing of the order item with the supplier, and the sending of the email to the customer. There are a few approaches, and so begins my journey to find the best solution…

First Ruby Experience…

So I’ve had this itch I’ve needed to scratch in the form of learning a little Ruby and writing something in that has a little more substance than just “Hello World!”. Well that time has come, and I’m pleased to say, I’ve got something! Not necessarily something that will solve world hunger, but I think it’s neat. I was inspired by some work I found over on Coder Wall, where Victor Martinez shared some nifty idea about taking a photo every time you perform a git commit. I decided to take it one step further, and have these photos uploaded to a G+ Photo Album. And here begins my Ruby experience…

Preamble

When running a ruby app/script, and you require a gem, as I did(curb and nokogiri), you need to also have “require rubygems” at the top of your ruby file to tell your app to load gems needed by your application, or else nothing will work. This is only relevant for those running ruby pre v1.9. I’m developing on my mac, so the default is v1.8.7. There are some great docs on ruby, and especially here (See section on Post Install) about dealing with ruby gems in your project.

Dependencies

So lets talk about what my project depends on in order to work.

  1. ImageSnap – Without this, it’s pretty useless. Unfortunately, it’s not all pure ruby. You’ll have to download it and install it somewhere on your machine where it’ll be universally available. There is an ImageSnap Ruby gem, but all that is, is a wrapper around calling imagesnap on the command line. If you’re interested, you can see it on Github
  2. Curb – really neat library for making HTTP requests, and dealing with responses.
  3. Nokogiri – Xml library to search and build xml docs
  4. Rubys built in OptionParser – Used to parse my ruby app command line like arguments

Configuration

Once you’ve downloaded the source, you can install the gem locally. Sorry, it’s not available publically – yet! Once you have the source downloaded, just run

gem build picasa-photo-uploader.gemspec
gem install picasa-photo-uploader

If you get an error about trying to load the picasa-photo-uploader gem, then you may very well have an issue about which ruby installation the script is trying to use. A good source on this can be found at http://docs.rubygems.org/read/chapter/19. On my mac I have the default 1.8 installed, but somehow I was actually using a different version of Ruby than I thought. Some changes to sym links fixed the issue for me🙂

Git Commit Hook

Finally, once everything was installed, it was time to edit my git commit hook script. I gleened all I needed to know off of the git docs. My commit script looks a little like this:

#!/usr/bin/env ruby
require 'picasa-photo-uploader'

unless File.directory?(File.expand_path("../../rebase-merge", __FILE__))
    file="~/.gitshots/#{Time.now.to_i}.jpg"
    PicasaPhotoUploader.snapAndUpload "GMAIL_EMAIL_ADDRESS" "PASSWORD" "ALBUM_NAME" "#{file}"
end
exit 0

Conclusion

While learning Ruby, it was a rocky journey. I am by no means any expert, but I think I have a fair grounding in what it is capable of. One thing I seriously battled with was getting different versions of Ruby to play nice. One tool to make this easier for you is called RVM. I strongly suggest you check it out if you haven’t already. I feel like my rocky journey wasn’t helped by the fact that I had the default v1.8.7 installed on my Mac – there seems to be endless issues with using <= 1.8.7 and installing gems etc. I found the last hurdle with the git post-commit script a long and arduous battle to get working. I had endless issues with the dreaded LoadError.

Source

If you’re interested, the source can be found here.

Ruby Gem install – mkmf.rb can’t find header files for ruby problem

Just a REALLY quick post. I’ve been seriously wanting to write a small ruby project for a long time now, and have finally been able to get a start on it. However, as I suspected, not everything is always smooth. I tried to do a gem install for some dependant libraries I’ll need, and got the following error:

System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby extconf.rb mkmf.rb can't find header files for ruby at /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/ruby.h

If you’re looking for a solution, then try the folloing:

  1. Load up xCode (assuming you have it installed)
  2. Once open, click on xCode -> Preferences -> Downloads
  3. In the list presented, click on the Command Line Tools, and install
  4. Restart
  5. Job Done!

This problem dissappeared for me, and I hope it helps whoever else experiencing this annoying problem!

Incidentally, I am using:

Mac OSX 10.8.2
Ruby Version: ruby 1.8.7 (2012-02-08 patchlevel 358) [universal-darwin12.0]

Watch this space for my first steps into the ruby world!