From Paris to Berlin — Creating Circuit-Breakers in Kotlin

From Paris to Berlin — Creating Circuit-Breakers in Kotlin

What are circuit-breakers? The stop a circuit when something wrong is happening right? But how can we control that?

1. Introduction

Circuit-breakers are used nowadays to avoid an excessive amount of requests being done to one or another non-responsive service. When for example a service shuts down, for whatever reason, a circuit-breaker should as its name states, break the circuit. In other words, in a situation where 1 million requests are being done at the same time to get the results of a horse race, we want these requests to be redirected to another service that can handle this. This other service can be a replica of it, or it can purely be used to perform other operations regarding the original service failure. The end goal is always to break unnecessary calls and conduct the flow to somewhere else. In 2017, Michael Nygard brought the Circuit Breaker design pattern to the forefront of software development design. This was done in his publication Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) 1st Edition. The circuit breaker design pattern is inspired by actual electronic and electrical circuits. However, in terms of a general concept, the circuit-breaker idea was actually invented in 1879 by Thomas Edison. Just like in that time, an overflowing current needs to be handled. In very, very simple terms this is what we are applying to software architecture in this case. The main goal is to make sure that the system is resilient enough. How resilient it must be and how fault-tolerant it must be is really in the eyes of the engineers responsible for facilitating the implementation of this pattern. The idea behind it is that under certain conditions we may want to redirect seamlessly the flow of a certain request to another more available flow behind the same endpoint. Let’s say we want to perform a request from A to B. From time to time B fails and C is always available. If B fails randomly we want to reach C in order to make our service fully available. However, to make requests back to B, we want to make sure that B doesn’t fail that much again. We can then configure our system to make random requests to B and only fully get back to B once the rate of fail has gone down a certain level. We may want to make request C on error, but also on latency. If B is very slow, we may want to re-send all requests to C. There are many other possible configurations, like trying to reach C if after a defined number of tries, request types, concurrent threads, and many other options This is also known as short-circuiting, and it is mostly a temporary move.

State Machine


In order to further understand our knowledge of what a circuit breaker actually is, we have to understand that a circuit break works as an entity in our application. There are three main statuses for a circuit-breaker. It can be closed, open, or half-open. A status closed, means that our application flows run normally. We can safely make requests to service A, knowing that all requests will go to service B. An open state means that all requests to service B will fail. The rules we have defined to represent a fail have occurred and no longer we reach service B. In this case, an exception is always returned. A half-open state is when our circuit-breaker is instructed to perform tests to service B to see if it is operational again. Every successful request is handled normally, but it will continue to make requests to C. If B behaves as expected according to the verification rules we have set in place, our circuit-breaker will return to a closed state, and service A will start making requests exclusively to service B. In most applications, a circuit-breaker follows the decorator design pattern. It can, however, be implemented manually, and we will have a look at three programmatically ways of implementing circuit-breakers and finally using an AOP-based implementation. The code is available on GitHub.

Diagram

2. Car checks

In the last point of this article, we will have a look at a car race game. However, before we get there, I’d like to guide you through some of the aspects of building an application running with a circuit breaker.

2.1. Kystrix (from-paris-to-berlin-kystrix-runnable-app)

Kystrixs, as a small DSL, is an amazing library invented and created by Johan Haleby. The library provides a lot of possibilities, including integration with Spring and Spring WebFlux. It is interesting to have a look at it and play around a bit:

<dependency>
    <groupId>se.haleby.kystrix</groupId>
    <artifactId>kystrix-core</artifactId>
</dependency>
<dependency>
    <groupId>se.haleby.kystrix</groupId>
    <artifactId>kystrix-spring</artifactId>
</dependency>

I have created an example, and it is located on module from-paris-to-berlin-kystrix-runnable-app on GitHub. First, we take a look at the code:

@GetMapping("/{id}")
private fun getCars(@PathVariable id: Int): Mono<Car> {
    return if (id == 1)
        Mono.just(Car("Jaguar")) else {
        hystrixObservableCommand<Car> {
            groupKey("Test2")
            commandKey("Test-Command2")
            monoCommand {
                webClient.get().uri("/cars/carros/1").retrieve().bodyToMono<Car>()
                    .delayElement(Duration.ofSeconds(1))
            }
            commandProperties {
                withRequestLogEnabled(true)
                withExecutionTimeoutInMilliseconds(5000)
                withExecutionTimeoutEnabled(true)
                withFallbackEnabled(true)
                withCircuitBreakerEnabled(false)
                withCircuitBreakerForceClosed(true)
            }
            fallback { Observable.just(Car("Tank1")) }
        }.toMono()
    }
}

This code represents command 2 of the example. Check the code for command 1. What is happening here is that we are defining the command we want with monoCommand. Here, we define the method we need to call. In commandProperties, we define the rules that make the circuit-breaker change state to open. We delay explicitly our call in order for it to last precisely 1 second. At the same time, we define a timeout of 5000 milliseconds. This means that we will never reach a timeout. In this example, we can make calls with an Id. Since this is just a test, we assume Id=1, to be an Id of a car, Jaguar with no need for a circuit-breaker. This also means that we will never get Tank1 as defined in the fallback method. If you haven’t noticed yet, take a close look at the fallback method. This method is using an Observable. Although WebFlux is implemented according to the Observable design pattern, Flux is not exactly an Observable. However, hystrix supports both. Please run the application and open your browser on http://localhost:8080/cars/2, to confirm this. It is important to understand that if you start making calls very early in the startup of Spring Boot, you may eventually get a Tank1 message. This is because the startup delay can surpass 5 seconds very easily depending on how you are running this process. In the second example, we are going to short-circuit our example to Tank 2:

@GetMapping("/timeout/{id}")
  private fun getCarsTimeout(@PathVariable id: Int): Mono<Car> {
      return if (id == 1)
          Mono.just(Car("Jaguar")) else {
          hystrixObservableCommand<Car> {
              groupKey("Test3")
              commandKey("Test-Command3")
              monoCommand {
                  webClient.get().uri("/cars/carros/1").retrieve().bodyToMono<Car>()
                      .delayElement(Duration.ofSeconds(1))
              }
              commandProperties {
                  withRequestLogEnabled(true)
                  withExecutionIsolationThreadInterruptOnTimeout(true)
                  withExecutionTimeoutInMilliseconds(500)
                  withExecutionTimeoutEnabled(true)
                  withFallbackEnabled(true)
                  withCircuitBreakerEnabled(false)
                  withCircuitBreakerForceClosed(true)
              }
              fallback { Observable.just(Car("Tank2")) }
          }.toMono()
      }
    }

In this example, our circuit-breaker will go into an open state end return Tank 2 as a response. This is because we are also causing a 1s delay here, but we specify that our circuit break condition triggers after the 500ms mark. If you know-how hystrix works, you’ll find that kystrix isn’t anything different moving forward. What hystrix didn’t provide for me at this point, was a seamless, effortless way to provide what I needed to make the game. Kystrix seems to work on a client basis. This means that we have to declare our code before making requests to services behind our main service.

2.2. Resilience4J

Resilience4J seems to be referenced by many as a very complete implementation of a circuit-breaker. My first trials went about exploring some important aspects of circuit-breakers. Namely, I wanted to see a circuit-breaker that could work on the basis of timeouts and frequency of successful requests. Resilience4J allows for different types of short-circuiting modules to be configured. These are separated into 6 different categories: CircuitBreaker, Bulkhead, Ratelimiter, Retry, and Timelimiter. All of these are also names of design patterns. The CircuitBreaker module provides a complete implementation of this pattern. We have lots of parameters we can configure, but essentially, the CircuitBreaker module allows us to configure what do we recognize as a fail, how many requests we allow in a half-open state and a sliding window, which can be configured by time or count, where we keep the count of requests occurring in a closed state. This is important to calculate the error frequency. Essentially, we could say that this CircuitBreaker the module will help us with the rate of requests, but that isn’t necessarily true. It depends on how you interpret it. It seems a better way to think of it as simply a way to deal with faults. Whether they come from a timeout or an exception, this is where they are dealt with and how the requests can be seamlessly redirected somewhere else The Bulkhead module is designed to deal with concurrent requests. It is not a rate limiter. Instead, it implements the Bulkhead design pattern, which is used to prevent too much processing to occur in one single endpoint. In this case, Bulkhead allows us to process our requests in a way that they get distributed across all available endpoints. The name Bulkhead comes from the different sealed compartments a large ship usually has to avoid being sunk, should an accident occur, and like in the case of ships, we need to define how many threads will be available in the thread pool and their lease time. The RateLimiter module is designed to handle the rate of requests. The difference between this and the Bulkhead module is essential that we want to be tolerant to rates up to a certain point. This means that we don’t need to cause failure for that. We just say, in the design that we don’t tolerate a rate above a certain value. In addition, we can either redirect a request or keep it on hold until permission to perform the request is granted. The Retry module is probably the easiest to understand since it does not have much in common with the other modules. We essentially explicitly declare the number of retries to a certain endpoint, until we reach our defined threshold. The Timelimiter module can be seen as a simplification of the CircuitBreaker module in that they both share the possibility to configure timeouts. However, Timelimiter does not depend on other parameters like sliding windows, and it also does not have an in-build failure threshold calculation. So, if we are purely interested in handling timeouts when calling a certain service, and we don’t factor in other possible faults, then we are probably better off with Timelimiter .

2.2.1. Resilience4J with Kotlin and No Spring framework (from-paris-to-berlin-resilience4j-runnable-app)

In this module, I’ve decided to use just the resilience4j kotlin library:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-kotlin</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-retry</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-circuitbreaker</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-ratelimiter</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-timelimiter</artifactId>
</dependency>

This implementation is available on repo on GitHub. We’ll first take a look at the TimeLimiter pattern:

var timeLimiterConfig: TimeLimiterConfig = TimeLimiterConfig.custom()
    .timeoutDuration(Duration.ofMillis(100))
    .build()

var timeLimiter: TimeLimiter = TimeLimiter.of("backendName", timeLimiterConfig)

private suspend fun getPublicCar(): Car {
    return timeLimiter.decorateSuspendFunction {
        getPrivateCar()
    }.let { suspendFunction ->
        try {
            suspendFunction()
        } catch (exception: Exception) {
            Car("Opel Corsa")
        }
    }
}

private suspend fun getPrivateCar(): Car {
    delay(10000)
    return Car("Lancya")
}

In this case, we are decorating our function getPrivateCar with the TimeLimiter functionality using function decorateSuspendFunction. What this will do is cause a timeout, if the function we call takes too long we get a Opel Corsa instead of a Lancya. To try this, we can just run the application and open http://localhost:8080/cars/timelimiter/normal/1. Looking into the implementation, we see that we can never get a Lancya. And this is because we purposely wait 10s before we return it back. Our TimeLimiter has a much lower timeout and so this will never work. A TimeLimiter is fairly simple to understand. A CircuitBreaker , on the other hand, can be a different story. This is an example of how that can be done:

val circuitBreakerConfig =
    CircuitBreakerConfig.custom()
        .failureRateThreshold(20f)
        .slowCallRateThreshold(50f)
        .slowCallDurationThreshold(Duration.ofMillis(1000))
        .waitDurationInOpenState(Duration.ofMillis(1000))
        .maxWaitDurationInHalfOpenState(Duration.ofMillis(1000))
        .permittedNumberOfCallsInHalfOpenState(500)
        .minimumNumberOfCalls(2)
        .slidingWindowSize(2)
        .slidingWindowType(COUNT_BASED)
        .build()


val circuitBreaker = CircuitBreakerRegistry.of(circuitBreakerConfig).circuitBreaker("TEST")

private suspend fun getPublicCar(id: Long): Car {
    return circuitBreaker.decorateSuspendFunction {
        getPrivateCar(id)
    }.let { suspendFunction ->
        try {
            suspendFunction()
        } catch (exception: Exception) {
            Car("Opel Corsa")
        }
    }
}

private fun getPrivateCar(id: Long): Car {
    if (id == 2L) {
        throw RuntimeException()
    }
    return Car("Lancya")
}

In this case, we are saying that we want our circuit breaker to close the circuit once the failure rate is lower than 20% with the property. Slow calls will also have a threshold, but in this case, it will be less than 50%. We say that a slow call must last longer than 1s to be considered one. We are also specifying that the duration of a half-open state should be 1s. This means in practice, that we either will have an open-state, a half-open state or a closed state. We also say that we allow a maximum of 500 half-open state requests. For error calculations, the circuit breaker needs to know at which mark will it do that. This is important to determine when to close the circuit. We say that 2 requests will be minimally necessary for this calculation, with the minimumNumberOfCalls property. Remember that half-open is when we’ll keep trying to get the circuit to close if the requests reach a safe fail threshold? In this configuration, it means that we need to make at least 2 requests, within the sliding window, to calculate the error frequency and determine whether to get back to a closed state or not. This is the accurate reading of all the variables we have configured. In general, what this means is that our application will probably make several calls to the alternative service, should there be one, it will not switch very easily from open to closed states given that the success rate to do that must be 80% during half-open states and the timeout for the open state must have occurred. There are many ways to specify such timeouts. In our example, we say that the maxDurationInHalfOpenState is 1s. This means that our CircuitBreaker will keep status open, only if our check does not satisfy the closed state condition or if this timeout has not yet occurred. The behavior defined in this CircuitBreaker it may be difficult to follow and to predict, purely because specific downtimes, rates, and other features of requests are just not possible to replicate exactly, but if we perform several requests to this endpoint, we’ll see that the behavior described above matches our experience. So let’s try and perform several requests to endpoints: http://localhost:8080/cars/circuit/1 and http://localhost:8080/cars/circuit/2. Ending in 1 is the endpoint of a successful car retrieval and ending in 2 is the endpoint of a failure in getting a specified car. Looking at the code, we see that anything other than a 2 means that we get a Lancya as a response. A 2, means that we immediately throw a runtime exception, which means that we end up getting an Opel Corsa as a response. If we just make requests to endpoint 1, we’ll keep seeing Lancya as a response. If the system starts failing, that is when you make requests to 2, you’ll see that getting back to Lancya will not be a constant after a while. The System will inform that it is in an Open state and that no more requests are permitted.

2021-10-20 09:56:50.492 ERROR 34064 --- [ctor-http-nio-2] .f.c.b.r.r.c.CarControllerCircuitBreaker : io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'TEST' is OPEN and does not permit further calls

Our circuit breaker will then go to half-open state, after a successful request and this means that we’ll need to perform a few requests back to 1 before it normalizes. We’ll be switching from Lancya to Opel Corsa a couple of times before we just get Lancya again. We defined this number to be 2. This is the minimal for error calculation. If we only cause one fail and keep on calling the non-fail endpoint, we can get a clearer picture of what is happening:

2021-10-20 11:53:29.058 ERROR 34090 --- [ctor-http-nio-4] .f.c.b.r.r.c.CarControllerCircuitBreaker : java.lang.RuntimeException
2021-10-20 11:53:41.102 ERROR 34090 --- [ctor-http-nio-4] .f.c.b.r.r.c.CarControllerCircuitBreaker : io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'TEST' is OPEN and does not permit further calls

This open-status message, while being true, it happened after I made 2 requests to the non-fail endpoint. This is why the state is said to be half-open .

2.2.2. Resilience4J with Spring Boot and No AOP (from-paris-to-berlin-resilience4j-spring-app)

In the previous segment, we have seen how to implement in a very programmatically way, without the use of any Spring technology. We did use Spring, but only to provide a WebFlux MVC type of service. Further, we didn’t change anything about the services themselves. In the following application, we’ll explore the following libraries:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot2</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-all</artifactId>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-reactor</artifactId>
</dependency>

When looking into how the code is done, we can see quite a big difference:

@RestController
@RequestMapping("/cars")
class CarController(
    private val carService: CarService,
    timeLimiterRegistry: TimeLimiterRegistry,
    circuitBreakerRegistry: CircuitBreakerRegistry,
    bulkheadRegistry: BulkheadRegistry
) {
    private var timeLimiter: TimeLimiter = timeLimiterRegistry.timeLimiter(CARS)
    private var circuitBreaker = circuitBreakerRegistry.circuitBreaker(CARS)
    private var bulkhead = bulkheadRegistry.bulkhead(CARS)

    @GetMapping("/{id}")
    private fun getCars(@PathVariable id: Int): Mono<Car> {
        return carService.getCar()
            .transform(TimeLimiterOperator.of(timeLimiter))
            .transform(CircuitBreakerOperator.of(circuitBreaker))
            .transform(BulkheadOperator.of(bulkhead))
            .onErrorResume(TimeoutException::class.java, ::fallback)
    }


    @GetMapping("/test/{id}")
    private fun getCarsTest(@PathVariable id: Int): Mono<Car> {
        return carService.getCar()
            .transform(TimeLimiterOperator.of(timeLimiter))
            .transform(CircuitBreakerOperator.of(circuitBreaker))
            .transform(BulkheadOperator.of(bulkhead))
            .onErrorResume(TimeoutException::class.java, ::fallback)
    }

    @GetMapping("/carros/{id}")
    private fun getCarros(@PathVariable id: Long): Mono<Car> {
        return Mono.just(Car("Laborghini"))
    }

    private fun fallback(ex: Throwable): Mono<Car> {
        return Mono.just(Car("Rolls Royce"))
    }
}

In this example and the following, we are going to look mostly at timeout properties. The intricacies of CircuitBreaker themselves are less relevant because this is an introductory article. What is important here to realize is how easy we can implement this with the decorators provided for Spring by Resilience4J. Although still in a programmatical fashion, we can easily decorate our initial publisher, the one we get from carService.getCar(), with the short-circuit types we want. In this example, we register a TimeLiter, a BulkHead and a CircuitBreaker . Finally, we define the fallback function to be triggered once a TimeoutException has occurred. What we still need to see is how is this all configured. We configure Resilience4J in Spring just like any other configurable module. We use application.yml:

resilience4j.circuitbreaker:
  configs:
    default:
      registerHealthIndicator: true
      slidingWindowSize: 10
      minimumNumberOfCalls: 5
      permittedNumberOfCallsInHalfOpenState: 3
      automaticTransitionFromOpenToHalfOpenEnabled: true
      waitDurationInOpenState: 5s
      failureRateThreshold: 50
      eventConsumerBufferSize: 10
      recordExceptions:
        - org.springframework.web.client.HttpServerErrorException
        - java.util.concurrent.TimeoutException
        - java.io.IOException
      ignoreExceptions:
#        - io.github.robwin.exception.BusinessException
    shared:
      slidingWindowSize: 100
      permittedNumberOfCallsInHalfOpenState: 30
      waitDurationInOpenState: 1s
      failureRateThreshold: 50
      eventConsumerBufferSize: 10
      ignoreExceptions:
#        - io.github.robwin.exception.BusinessException
  instances:
    cars:
      baseConfig: default
    roads:
      registerHealthIndicator: true
      slidingWindowSize: 10
      minimumNumberOfCalls: 10
      permittedNumberOfCallsInHalfOpenState: 3
      waitDurationInOpenState: 5s
      failureRateThreshold: 50
      eventConsumerBufferSize: 10
#      recordFailurePredicate: io.github.robwin.exception.RecordFailurePredicate
resilience4j.retry:
  configs:
    default:
      maxAttempts: 3
      waitDuration: 100
      retryExceptions:
        - org.springframework.web.client.HttpServerErrorException
        - java.util.concurrent.TimeoutException
        - java.io.IOException
      ignoreExceptions:
#        - io.github.robwin.exception.BusinessException
  instances:
    cars:
      baseConfig: default
    roads:
      baseConfig: default
resilience4j.bulkhead:
  configs:
    default:
      maxConcurrentCalls: 100
  instances:
    cars:
      maxConcurrentCalls: 10
    roads:
      maxWaitDuration: 10ms
      maxConcurrentCalls: 20

resilience4j.thread-pool-bulkhead:
  configs:
    default:
      maxThreadPoolSize: 4
      coreThreadPoolSize: 2
      queueCapacity: 2
  instances:
    cars:
      baseConfig: default
    roads:
      maxThreadPoolSize: 1
      coreThreadPoolSize: 1
      queueCapacity: 1

resilience4j.ratelimiter:
  configs:
    default:
      registerHealthIndicator: false
      limitForPeriod: 10
      limitRefreshPeriod: 1s
      timeoutDuration: 0
      eventConsumerBufferSize: 100
  instances:
    cars:
      baseConfig: default
    roads:
      limitForPeriod: 6
      limitRefreshPeriod: 500ms
      timeoutDuration: 3s

resilience4j.timelimiter:
  configs:
    default:
      cancelRunningFuture: false
      timeoutDuration: 2s
  instances:
    cars:
      baseConfig: default
    roads:
      baseConfig: default

This file is an example file taken from their repo and modified to my example accordingly. As we have seen before, instances of the different types of limiters/short-circuit, have a name. The name is very important if you have many different registries and different limiters. For our example, and just like mentioned before, we are interested in the timelimiter . We can see that it is limited to 2s . If we look at the way we have the service implemented we see that we are forcing a timeout to happen:

@Component
open class CarService {

    open fun getCar(): Mono<Car> {
        return Mono.just(Car("Fiat")).delayElement(Duration.ofSeconds(10));
    }

}

Let’s start the application and in the browser go to: http://localhost:8080/cars/test/2. Instead of getting a Fiat , we’ll get a Rolls Royce. This is how we defined the timeout. In the same way, we can easily create a CircuitBreaker .

3. Case

Up until now, we have seen three essential ways to implement CircuitBreakers and related limiters. Further, we will have a look at my favorite way of implementing circuit breakers by going through an application I’ve made, which is a very simple game where we just click on squares in order to get from Paris to Berlin. The game is made to understand how to implement. It does not say much about where to implement this. It is just a case I’ve designed to share with you the know-how. The know-when I leave it to you to decide later. Essentially we want to create a number of cars and establish a route to Berlin. In different locations in this route, we’ll get to cities where randomly, we’ll create problems. Our circuit breakers will decide how much time we’ll have to wait before we are allowed to move on. The other cars have no problem, and we just need to pick the right route. We are allowed to check a timetable where it is registered when a certain problem will happen in a city on a certain minute mark. The minute mark is valid in its 0 indexed positions. This means that 2 means that every 2, 12, 22, 32, 42, 52-minute mark on the clock will be valid to create this problem. Problems will be of 2 sorts: ERROR and TIMEOUT. An error fail will make you dealy 20 seconds. A timeout will give you a delay of 50 seconds. For every city change, everyone has to wait 10 seconds. Before waiting, however, the car is already at the entry of the following city when this is done in the fallback methods. In this case, the next city is chosen randomly.

Game Page

4. Implementation

We have seen before, how to configure our resilience4j registry using application.yml. Having done this, let’s have a look at some examples on how to decorate our functions:

@TimeLimiter(name = CarService.CARS, fallbackMethod = "reportTimeout")
@CircuitBreaker(name = CarService.CARS, fallbackMethod = "reportError")
@Bulkhead(name = CarService.CARS)
open fun moveToCity(id: Long): Mono<RoadRace> {
    val myCar = roadRace.getMyCar()
    if (!myCar.isWaiting()) {
        val destination = myCar.location.forward.find { it.id == id }
        val blockage = destination?.blockageTimeTable?.find {
            it.minute.toString().last() == LocalDateTime.now().minute.toString().last()
        }
        blockage?.let { roadBlockTime ->
            when (roadBlockTime.blockageType) {
                BlockageType.TIMEOUT -> return Mono.just(roadRace).delayElement(Duration.ofSeconds(10))
                BlockageType.ERROR -> return Mono.create { it.error(BlockageException()) }
                BlockageType.UNKNOWN -> return listOf(Mono.create { it.error(BlockageException()) },
                    Mono.just(roadRace).delayElement(Duration.ofSeconds(10))).random()
                else -> print("Nothing to do here!")
            }
        }

        destination?.let {
            myCar.delay(10)
            myCar.location = it
            myCar.formerLocations.add(myCar.location)
        }
    }
    return Mono.just(roadRace)
}

private fun reportError(exception: Exception): Mono<RoadRace> {
    logger.info("---- **** error reported")
    roadRace.getMyCar().delay(20L)
    roadRace.getMyCar().randomFw()
    roadRace.errorReports.add("Error reported! at ${LocalDateTime.now()}")
    return Mono.create { it.error(BlockageException()) }
}

private fun reportTimeout(exception: TimeoutException): Mono<RoadRace> {
    logger.info("---- **** timeout reported!")
    roadRace.getMyCar().delay(50L)
    roadRace.getMyCar().randomFw()
    roadRace.errorReports.add("Timeout reported! at ${LocalDateTime.now()}")
    return Mono.just(roadRace)
}

As we can see, the original service calls are directly decorated using annotations!. This can only be done due to the presence of the AOP module in the package:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

AOP, or Aspect Oriented Programming is another programming paradigm based on OOP. It is considered as a complement to OOP, and it is precisely how many annotations work. This allows for the triggering of other functions around, before, or after the original method in precise cut-points. As you can see from the example, we are either generating timeouts or errors. The BlockageException , is generated also inside the fallback method. This does not represent a problem. Except at the response. However, the application is running on WebSockets and therefore this error will not be seen in the application. So far this was the game. I implemented this with the focus of showing how using annotations can make our life much easier when implementing a resilient application. We not only have CircuitBreakers implemented, but also other technologies, like WebSockets, Spring WebFlux, Docker, NGINX, typescript, and several others. This has all been made to see how CircuitBreakers would play out in an application. If you want to play with this application, please go to the root of the project and run:

make docker-clean-build-start

Then run this command:

curl -X POST http://localhost:8080/api/fptb/blockage -H "Content-Type: application/json" --data '{"id":1,"name":"Paris","forward":[{"id":2,"name":"Soissons","forward":[{"id":5,"name":"Aken","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":6,"name":"Heerlen","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":7,"name":"Düren","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":3,"name":"Compiègne","forward":[{"id":5,"name":"Aken","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":6,"name":"Heerlen","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":7,"name":"Düren","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":4,"name":"Reims","forward":[{"id":5,"name":"Aken","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":6,"name":"Heerlen","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]},{"id":7,"name":"Düren","forward":[{"id":8,"name":"Berlin","forward":[],"blockageTimeTable":[]}],"blockageTimeTable":[]}],"blockageTimeTable":[]}],"blockageTimeTable":[]}'

The payload of this request is generated using module from-paris-to-berlin-city-generator. If you look into this module, you’ll see that it is quite simple to understand and that you can generate your own map for the game! Finally, go to http://localhost:9000 and your application should be running!. Now you should just click on the right squares to play the game. Just don’t click on the red ones if you want to win. If you want to see circuit-breaker in action though, then please run the application logs:

docker logs from_paris_to_berlin_web -f

And explicitly click on the red squares in order to cause fail.

5. How Kystrix and Resilience4J Differ

Kystrix is ideal in cases where your application is small, and you want to make sure to keep the usage of the DSL really low. The only downside it seems is that it does not offer an easy way to decorate methods to be affected by a circuit-breaker. Resilience4J seems to be a great option for enterprise work with circuit-breakers. It does provide annotation-based programming, uses all benefits van AOP, and its modules are separated. In a way, it can also be used strategically for critical points in the application. It can also be used as a complete framework to cover many aspects of an application.

6. Conclusion

Regardless of the brand we choose, the goal is always to have a resilient application. In this article, I showed some examples of how I personally experienced investigating circuit breakers and my findings on a very high level. This means that this article is really written for people who want to know what circuit breakers are and what Limiters can do.


The possibilities are quite frankly endless when thinking about improving our applications with resilience mechanisms like circuit breakers. This pattern does allow fine-tuning of an application in order to make better usage of the available resources we have. Mostly in the cloud, it is still very much important to optimize our costs, and how many resources we actually need to allocate.


Configuring CircuitBreakers is not a trivial task as it is for the Limiters, and we really need to understand all the configuration possibilities in order to reach optimum levels of performance and resilience. This is the reason I didn’t want to go into details in this introductory article about circuit breakers.


Circuit-breakers can be applied to many different types of applications. Most messaging, streaming types of applications will need this. For applications that handle large volumes of data that need to also be highly available, we can and should implement some form of the circuit-breaker. Large online retail stores need to handle massive amounts of data on a daily basis and in the past, Hystrix was widely used. Currently, we appear to be moving in the direction of Resilience4J which encompasses a lot more than this.

6. References

Thank you!

I hope you enjoyed this article as much as I did making it!
Please leave a review, comments or any feedback you want to give on any of the socials in the links bellow.
I’m very grateful if you want to help me make this article better.
I have placed all the source code of this application on GitHub.
Thank you for reading!