City Library — An advanced guide to Circuit Breakers in Kotlin

1. Introduction

I’ve always been very interested in circuit breakers and how do they work in Software Architecture. I’ve also worked on projects where one of the implementations of such design patterns was used. Namely, for online large distribution stores, many companies love to use Hystrix. However, Hystrix is reaching the end of the road and a new replacement is available called Resilience4j. Regardless, I’ve seen many ways in which Hystrix has been used. The possibilities are quite extensive, but in practice, in my professional life, I’ve seen that not even half of the potential of a Circuit-Breaker design pattern implementation is used. Like Resilience4J, there are many different implementations of this software design pattern available like FailSafe, GoBreaker, Circuit-Breaker-For-Scala, and many others. It’s always good to remind ourselves that the idea of Circuit-Breaker has existed since the 19th Century and back then it was related to how a switch would break a circuit should the electrical current increase up to unacceptable levels. Fast-forward to 2017, Michael Nygard publishes Release IT! Design and Deploy Production-Ready Software (Pragmatic Programmers) where he brings the idea of Circuit-Breakers design pattern to software development. In this article, I’m assuming that you already have an idea of what circuit-breakers really are and what this design pattern is fundamentally about. This design pattern allows many sorts of types of configuration, and we are going to have a look at the most important ones of them as we go along. I’m also assuming that you are quite familiar with how Spring works and also that you have some idea of how the performance test tool Locust works. It is also important to have a general idea of how Reactive programming is used in Spring WebFlux. The implementation we are going to see is located on GitHub.

2. Case

For demonstration purposes, we’ll make a city library!. We are essentially going to provide one or more books to a library and just provide them to users online for reading purposes. Before we actually build our library’s software, we are right off-the-bat concerned if the library can live up to resilient expectations. As many users are going to use the library, it must be there constantly providing to users. The Library has a perfect service already working in the cloud using Reactive Spring WebFlux technologies. The only drawback is that the cloud provider goes into maintenance mode from time to time. The Library has a fallback service for this though. A very old server machine in the attic has been designed for high availability as well. Although it is an on-premise installation, it works well enough for the usual library user. It is expected that library users will experience some latency should this fallback be needed.

Generic Diagram

3. Circuit breakers

In this section, we’ll go through unit tests I’ve designed in order to analyze the specific properties of circuit breakers. We will check how the circuit-breaker status changes and understand how to see that via the Spring health endpoint. Finally, we will make a thorough analysis of the circuit-breakers created working together and make a graphic analysis of the results.

Detailed Generic Diagram

In order to continue with our example, it is important that we first take a look at some important implementation basics. It is also important to understand that in the Kotlin code, practically nothing changes in regard to the setup of our circuit breakers and so, we’ll just have a look at the implementation of circuit-breaker implementation for test 1.

@Service
open class AlmG1BookService(
    private val webClientInterface: WebClientInterface
) {
    private val logger = KotlinLogging.logger {}

    @CircuitBreaker(name = ALMR_TC_1, fallbackMethod = "getBookByIdJPA")
    open fun getBookCBById(id: Long): Mono<BookDto> =
        webClientInterface.getBookViaReactiveServiceById(id)

    open fun getBookByIdJPA(id: Long, exception: Exception): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }
    open fun getBookByIdJPA(id: Long, exception: WebClientRequestException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }

    open fun getBookByIdJPA(id: Long, exception: ReactiveAccessException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }

    open fun getBookByIdJPA(id: Long, exception: CallNotPermittedException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }

    open fun getBookByIdJPA(id: Long, exception: TimeoutException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }

    open fun getBookByIdJPA(id: Long, exception: IgnoredException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.getBookViaJpaServiceById(id)
    }

    @CircuitBreaker(name = ALMR_TC_1, fallbackMethod = "createBookByIdJPA")
    open fun createBook(bookDto: BookDto): Mono<BookDto> {
        return webClientInterface.sendBookViaReactiveService(bookDto)
    }

    open fun createBookByIdJPA(bookDto: BookDto, exception: WebClientRequestException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.sendViaJpaServiceBook(bookDto)
    }

    open fun createBookByIdJPA(bookDto: BookDto, exception: ReactiveAccessException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.sendViaJpaServiceBook(bookDto)
    }

    open fun createBookByIdJPA(bookDto: BookDto, exception: CallNotPermittedException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.sendViaJpaServiceBook(bookDto)
    }

    open fun createBookByIdJPA(bookDto: BookDto, exception: TimeoutException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.sendViaJpaServiceBook(bookDto)
    }

    open fun createBookByIdJPA(bookDto: BookDto, exception: IgnoredException): Mono<BookDto> {
        logger.info("Current Exception:", exception)
        return webClientInterface.sendViaJpaServiceBook(bookDto)
    }

    companion object {
        const val ALMR_TC_1 = "almr_testcase_1"
    }
}

Looking at this class, we see that the name of circuit-breaker for test case 1 is almr_testcase_1. The implementation is quite easy to follow. We only need to see that the fallback method is capable to receive the input parameter of the original method and that it can use that parameter in the alternative method call. For example, if we try to get a Book with Id=1 and then we get a TimeoutException, we’ll then fall into the overloaded method getBookByIdJPA, which then tries to make a call to our JPA service. In our example, we assume that the JPA service is always available. This is all possible because we have annotated our original method, the one called getBookCBById with CircuitBreaker. That way, if we call getBookViaReactiveServiceById and it succeeds, we don’t have to use the fallback method. If it does, however, then we fall into the CircuitBreaker algorithm, which is the purpose of this article. In order to understand the behind-the-scenes of this project, it is also important that we understand what does our WebClientInterface does. In this case, it is important to check the implementation at that happens while running the containers.

@Component
@Profile("prod", "docker")
class WebClient(
    @Value("\${org.jesperancinha.management.reactive.host}")
    val reactiveHost: String,
    @Value("\${org.jesperancinha.management.mvc.host}")
    val mvcHost: String
) : WebClientInterface {


    private val webClientReactive: WebClient = create("http://$reactiveHost:8081")
    private val webClientMvc: WebClient = create("http://$mvcHost:8082")

    override fun getBookViaReactiveServiceById(id: Long): Mono<BookDto> = webClientReactive.get()
        .uri("/api/alm/reactive/books/$id").retrieve().bodyToMono()

    override fun getBookViaJpaServiceById(id: Long): Mono<BookDto> = webClientMvc.get()
        .uri("/api/alm/mvc/books/$id").retrieve().bodyToMono()

    override fun sendBookViaReactiveService(bookDto: BookDto): Mono<BookDto> = webClientReactive.post()
        .uri("/api/alm/reactive/books/create")
        .header(CONTENT_TYPE, APPLICATION_JSON_VALUE)
        .body(Mono.just(bookDto), BookDto::class.java)
        .retrieve().bodyToMono()

    override fun sendViaJpaServiceBook(bookDto: BookDto): Mono<BookDto> = webClientMvc.post()
        .uri("/api/alm/mvc/books/create")
        .header(CONTENT_TYPE, APPLICATION_JSON_VALUE)
        .body(Mono.just(bookDto), BookDto::class.java)
        .retrieve().bodyToMono()
}

As we have seen before, this is where we create our publishers or Mono’s. They will make calls to the reactive service or the non-reactive service depending on the interactions with the CircuitBreaker algorithm. Now it’s time to check what happens in the tests. In this case, the unit tests are quite extensive, and so I will not show the full class in this article as I usually do. In this case, only important pieces of the code will be shown for illustration purposes. The code can however be checked on GitHub.

3.1. Gate 1 test — almr_testcase_1— General properties.

In this test, we check the registerHealthIndicator, slidingWindowSize, slidingWindowSizeType, minimumNumberOfCalls , automaticTransitionFromOpenToHalfOpenEnabled and waitDurationInOpenState:

registerHealthIndicator: true
slidingWindowSize: 10
slidingWindowType: "COUNT_BASED"
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 50
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 1s
recordExceptions:
  - org.springframework.web.client.HttpServerErrorException
  - java.util.concurrent.TimeoutException
  - java.io.IOException
  - org.jesperancinha.management.gate.exception.ReactiveAccessException
ignoreExceptions:
  - org.jesperancinha.management.gate.exception.IgnoredException

The first step is to register the health indicator. This is a feature that is very important to check the status of the CircuitBreaker. In order for this to work, we need to activate the right management endpoints provided by the Spring actuator:

management:
  endpoints.web.exposure.include: "*"
  endpoint.health.show-details: always
  health:
    circuitbreakers.enabled: true
    ratelimiters.enabled: true

This will make our actuator endpoint available at:

http://localhost:8080/api/almg/actuator/health

And this will give a result like this:

{
  "status": "UP",
  "components": {
    "circuitBreakers": {
      "status": "UP",
      "details": {
        "almr_testcase_3": {
          "status": "UP",
          "details": {
            "failureRate": "-1.0%",
            "failureRateThreshold": "50.0%",
            "slowCallRate": "-1.0%",
            "slowCallRateThreshold": "50.0%",
            "bufferedCalls": 3,
            "slowCalls": 0,
            "slowFailedCalls": 0,
            "failedCalls": 0,
            "notPermittedCalls": 0,
            "state": "CLOSED"
          }
        },
        (...)
      },
      "diskSpace": {
        "status": "UP",
        "details": {
          "total": 62725623808,
          "free": 56340504576,
          "threshold": 10485760,
          "exists": true
        }
      },
      "ping": {
        "status": "UP"
      },
      "rateLimiters": {
        "status": "UNKNOWN"
      }
    }
  }
}

This is just a short form of the circuit-breaker health status. We normally see all statuses of all circuit-breakers. We see only almr_testcase_3 CircuitBreaker status in this case. Notice that UP is the same as CLOSED in this case. In order to make our unit test work and this is valid for all our test cases, I created a domain model specific for the health status in order to get the status property of each CircuitBreaker. This is located on GitHub.

We can now have a look at the remaining properties. The slidingWindowSize, slidingWindowSizeType, minimumNumberOfCalls, automaticTransitionFromOpenToHalfOpenEnabled, failureRateThreshold, and waitDurationInOpenState, are all essential properties we need to configure. The documentation provided by the Resilience4J team is very extensive and documents this really well. It is nonetheless important to quickly go through these properties. Perhaps the first thing we need to look at is how does the CircuitBreaker determines when do we go to an open state. Well, it always goes to an open state once a failure has occurred. The logical question that follows is how does it know to go back to a closed state from an open state? The answer is, it doesn’t. It first needs to go to a half-open state. This happens after a timeout. This timeout is determined by our first property: waitDurationInOpenState. Once this happens, the state goes from open to half-open. There, the CircuitBreaker will redirect calls to the original service. It can here go back to a closed status or just go back to an open state. This is determined by our next property failureRateThreshold. This failure rate also determines going from close to open also on successful requests in some cases. The failure rate is a figure calculated by using other configured parameters. The minimumNumberOfCalls represents the minimum calls needed to start performing this calculation. The slidingWindowSize can be represented by either second or by a number of requests. All requests that fall within this sliding window are considered to calculate the rate of failing requests. Up until now, we know how to determine only one type of failing request. Those are the ones we do not ignore with property ignoreExceptions. Finally, we see that the property automaticTransitionFromOpenToHalfOpenEnabled is true. This only means we do not need to wait for a request to be done in order to automatically make the switch to a half-open state from an open state. It is used for corner cases. Now it’s time to have a look at the first test. This is a simple test for properties recordExceptions and ignoreExceptions:

@Test
fun testGetBookByIdTestWhenIgnoredExceptionThenNull() {
    every { webClient.getBookViaReactiveServiceById(100L) } returns Mono.error(IgnoredException())
    every { webClient.getBookViaJpaServiceById(100L) } returns Mono.just(BookDto(0L, "Solution"))
    val bookById = almG1BookService.getBookById(100L)
    repeat(10) {
        val bookById = almG1BookService.getBookCBById(100L)
        bookById.shouldNotBeNull()
        bookById.blockOptional().ifPresent { book ->
            book.title.shouldBe("Solution")
            getCBStatus().shouldBe("UP")
        }
    }
    getCBStatus().shouldBe("UP")
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book -> book.title.shouldBe("Solution") }
}

In this case, we see that we see that the CircuitBreaker status remains open even though we threw an exception. Our IgnoredException has of course been ignored. We can now have a look at our first complete test. Note that all the following tests will be done as a reference to this one:

every { webClient.getBookViaReactiveServiceById(100L) } returns Mono.error(ReactiveAccessException())
every { webClient.getBookViaJpaServiceById(100L) } returns Mono.just(BookDto(0L, "SolutionOpen"))
getCBStatus().shouldBe("UP")
repeat(4) {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionOpen")
    }
}

In the first instance, we create 4 requests that fail against the reactive service. We get the predicted response from the non-reactive service: SolutionOpen.

getCBStatus().shouldBe("UP")
every { webClient.getBookViaReactiveServiceById(100L) } returns Mono.just(BookDto(0L, "SolutionClosed"))
runBlocking {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}

At this point, we make a request that goes well via the reactive service. If you didn’t notice yet, we just made 4 unsuccessful requests followed by a successful request. Our minimumNumberOfCalls is set to 5 and the rate is set to 50%. We have now clearly crossed the minimum acceptable rate.

getCBStatus().shouldBe("CIRCUIT_OPEN")
repeat(3) {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionOpen")
    }
}

This is why the circuit now remains open for the remaining 3 successful calls. After this balanced ratio, we should be able to go back to a closed state.

repeat(40) {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionOpen")
    }
}

We do not get back to a closed state after 40 calls because they all happen in less than 1 second, which is what we have configured for waitDurationInOpenState.

sleep(1000)
getCBStatus().shouldBe("CIRCUIT_HALF_OPEN")
repeat(4) {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}

After waiting 1 second, we are now in a half-open state, but because we are back, the sliding window is reset. We need one more request to get back to an open state.

getCBStatus().shouldBe("CIRCUIT_HALF_OPEN")
runBlocking {
    val bookById = almG1BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}
getCBStatus().shouldBe("UP")

This test shows, how to use all of these simple properties. We are now going to dive into other less-known but equally important properties.

Sliding Window Properties

3.2. Gate 2 test — almr_testcase_2-automaticTransitionFromOpenToHalfOpenEnabled

By comparing this test to test one, we just want to test the effect of the automaticTransitionFromOpenToHalfOpenEnabled property:

registerHealthIndicator: true
slidingWindowSize: 10
slidingWindowType: "COUNT_BASED"
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 50
automaticTransitionFromOpenToHalfOpenEnabled: false
waitDurationInOpenState: 1s
recordExceptions:
  - org.springframework.web.client.HttpServerErrorException
  - java.util.concurrent.TimeoutException
  - java.io.IOException
  - org.jesperancinha.management.gate.exception.ReactiveAccessException
ignoreExceptions:
  - org.jesperancinha.management.gate.exception.IgnoredException

This test case is exactly the same case as case 1 with the exception that we do not automatically move back to half-open automatically.

repeat(40) {
    val bookById = almG2BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionOpen")
    }
}
sleep(1000)
getCBStatus().shouldBe("CIRCUIT_OPEN")
repeat(4) {
    val bookById = almG2BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}

After we send the 40 calls in less than one second, we wait one second as in the previous case. however, this time we get an open state.

getCBStatus().shouldBe("CIRCUIT_HALF_OPEN")
runBlocking {
    val bookById = almG2BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}

Following one or more requests, it finally jumps to a half-open state.

3.3. Gate 3 test — almr_testcase_3 — Slow calls

In this test, we check properties slowCallDurationThreshold and slowCallRateThreshold:

registerHealthIndicator: true
slowCallRateThreshold: 50f
slowCallDurationThreshold: 100
slidingWindowSize: 10
slidingWindowType: "COUNT_BASED"
minimumNumberOfCalls: 5
failureRateThreshold: 50
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 1s
recordExceptions:
  - org.springframework.web.client.HttpServerErrorException
  - java.util.concurrent.TimeoutException
  - java.io.IOException
  - org.jesperancinha.management.gate.exception.ReactiveAccessException
ignoreExceptions:
  - org.jesperancinha.management.gate.exception.IgnoredException

The other type of error we can count are errors based on slow calls. This is managed through slowCallRateThreshold and slowCallDurationThreshold. These are measured on percentage and milliseconds.

every { webClient.getBookViaReactiveServiceById(100L) } returns Mono.just(BookDto(0L, "SolutionSlow"))
    .delayElement(
        Duration.ofMillis(200)
    )
every { webClient.getBookViaJpaServiceById(100L) } returns Mono.just(BookDto(0L, "SolutionOpen"))
getCBStatus().shouldBe("UP")
repeat(4) {
    val bookById = almG3BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionSlow")
    }
}

Other than this, this test is no different from test case 1. The only difference is that this one is based on timeouts, and we can see that it handles the timeouts the same way as all the other accepted exceptions.

3.4. Gate 4 test — almr_testcase_4 — permittedNumberOfCallsInHalfOpenState

In comparison with the Gate test we now look at the permittedNumberOfCallsInHalfOpenState property:

registerHealthIndicator: true
slidingWindowSize: 10
slidingWindowType: "COUNT_BASED"
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 50
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 1s
permittedNumberOfCallsInHalfOpenState: 2
recordExceptions:
  - org.springframework.web.client.HttpServerErrorException
  - java.util.concurrent.TimeoutException
  - java.io.IOException
  - org.jesperancinha.management.gate.exception.ReactiveAccessException
ignoreExceptions:
  - org.jesperancinha.management.gate.exception.IgnoredException

The property permittedNumberOfCallsInHalfOpenState, means that the half-open state will remain up to a maximum of the calls configured.

repeat(40) {
    val bookById = almG4BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionOpen")
    }
}
sleep(1000)
getCBStatus().shouldBe("CIRCUIT_HALF_OPEN")
repeat(4) {
    val bookById = almG4BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}
getCBStatus().shouldBe("UP")

In our case, we have two calls configured. Since they succeed, and we pass the 2 threshold, this means that we go directly to a closed state. This property should be used carefully because it can potentially bypass certain calculations that we may be counting on.

3.5. Gate 5 test — almr_testcase_5 — TIME_BASED

Tests the TIME_BASED value of the slidingWindowType:

registerHealthIndicator: true
slidingWindowSize: 1
slidingWindowType: "TIME_BASED"
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 50
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 1s
recordExceptions:
  - org.springframework.web.client.HttpServerErrorException
  - java.util.concurrent.TimeoutException
  - java.io.IOException
  - org.jesperancinha.management.gate.exception.ReactiveAccessException
ignoreExceptions:
  - org.jesperancinha.management.gate.exception.IgnoredException

If the slidingWindowType is TIME_BASED then we cannot easily predict what is going to happen. We basically just say that our window slides in the orderly manner of time. Time is a variable not easy to keep up with and so the unit test for case 5 contains lots of variations in terms of expected results.

getCBStatus() shouldBeIn listOf("UP", "CIRCUIT_OPEN")
repeat(40) {
    val bookById = almG5BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title shouldBeIn listOf("SolutionClosed", "SolutionOpen")
    }
}
sleep(1000)
getCBStatus() shouldBeIn listOf("UP", "CIRCUIT_OPEN", "CIRCUIT_HALF_OPEN")
repeat(4) {
    val bookById = almG5BookService.getBookCBById(100L)
    bookById.shouldNotBeNull()
    bookById.blockOptional().ifPresent { book ->
        book.title.shouldBe("SolutionClosed")
    }
}

4. Running the example

To make things easy, I’ve created some scripts to run the whole application. It builds everything in one go and then all you have to do is start locust

Start docker-compose

make docker-clean-build-start

Start locust

make locust-start

Go to the locust main page on http://0.0.0.0:8089/

Locust Main Screen

Make a choice. For me, it works better to have 10 users and a spawn rate of 10 users per second.
You can optionally create a book or several. Check the repo for more examples.

curl -X POST -H "Content-Type: application/json" --data '{"id": 1,
    "title": "Wild",
    "authors": [
      "Chery Strayed"
    ],
    "year": 2012,
    "publisher": "Uitgeverij Rainbow B.V."
  }' http://localhost:8080/api/almg/books/g1

5. Locust Graphics

Locust is quite an easy tool to learn, especially for this example. Please look into the locust documentation for the installation notes. I’ve performed some tests and the results are available on GitHub.

Graphic 1 - Circuit Breakers

Our default examples run on a waitDurationInOpenState of 1s. This doesn’t provide a great result purely because 1 second in the open state is very difficult to visualize. The time in between where we don’t get any requests is the time I’ve taken down the reactive service followed by the MVC-JPA service. To stop these services, I’ve created some scripts for that. So if you want to stop the reactive service, you can use make docker-stop-reactive. If you want to stop the MVC service you can run make docker-stop-mvc. The past results are the reason why I’ve increased that value to 10s in the docker Spring profile.

Graphic 2 - Circuit Breakers

This graph is broken into three very visible sections. The left section is where I’ve started the docker images and stopped the reactive service. As you can see, there seems to be a higher rate interval of roughly 10 seconds at the start of the test, separated by a low rate small interval. All 5 circuit breakers are still quite synchronized with each other, and they all remain in open state for 10 seconds before going back to a half-open state, where they try to reach a successful rate. That never happens and instead, they reach timeout. This contributes to the rate of responses in the following cycle. Because all of these circuit breakers have slightly different rules, their individual contributions to the overall rate of successful requests get mixed and turn quite homogenous between 11:10.15 and 11:11.07. This is where I re-activated the reactive service. This causes the CircuitBreaker to go back to a closed state. The rate of successful messages goes high because of two reasons. We are now getting responses through a reactive service and also the circuit is closed. If the circuit is closed we do not have to introduce any testing costs. At some point around 11:12.37, I stop the reactive service again. The overwhelming amount of messages sent induces a behaviour in the gate where it looks to become unresponsive for about 20 seconds. At timestamp 11:13.02, the rate goes up and down with the same pattern as in the beginning. The CircuitBreaker is again switching from open to half-open and from half-open to open. As we have seen before, these are costly operations that are needed in order to see if we can get the gate to work as efficiently as possible once the reactive service is back online again.

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
GET	/api/almg/books/g1/1	8109	0	23	58.675751992970994	6.021648000000823	18190.172146000008	54.0	28.879575696069523	0.0	23	27	31	34	42	53	77	2200	3100	18000	18000
GET	/api/almg/books/g2/1	8168	0	23	68.61528426885404	6.179668000015681	18268.55734	54.0	29.089699628252045	0.0	23	28	32	35	45	58	91	2200	3100	18000	18000
GET	/api/almg/books/g3/1	8172	0	22	83.3162769492164	5.762567000033414	29997.531783999988	54.0	29.103945318569508	0.0	22	27	30	33	43	55	86	2200	3100	30000	30000
GET	/api/almg/books/g4/1	8028	0	23	41.57623488938724	6.1160610000001725	18286.493133999982	54.0	28.59110046714097	0.0	23	27	31	33	42	52	68	80	3100	18000	18000
GET	/api/almg/books/g5/1	8159	0	22	87.60661377055992	6.197885999995378	29991.914000999997	54.0	29.057646825037764	0.0	22	27	30	33	43	55	82	2300	18000	30000	30000
Aggregated		40636	0	23	68.0595606234371	5.762567000033414	29997.531783999988	54.0	144.7219679350698	0.0	23	27	31	34	43	54	77	2200	3100	30000	30000

This table is probably the most surprising of our results. As we can see, although we’ve seen enormous variations in the rate of successful requests to the gate, NONE of them failed.

Graphic 3 - Circuit Breakers

And as a control test, we do see that requests do fail if the gate is running, but both reactive and MVC services are down.

6. Conclusion

What we’ve seen in this article are the multiple ways in which a circuit breaker can work in our favor using different configurations. What this example has in common with real-life scenarios is a short-circuit to other services and data sources. A circuit breaker can give you the idea that it works pretty much as a load balancer. However, the difference is quite extensive even though there are some similarities. Both of these solutions can provide an alternative path to data sources that we know provide that same information. However, this is where the similarities end. In the case of a circuit-breaker, it really depends on requirements, but the alternative data source does not necessarily have to provide the same data as the original source. It is just there to be activated when things go wrong in the main source. The main source can for example be a cluster of services. If somehow a connection goes wrong, it is the job of the circuit-breaker to detect that and redirect all traffic to the new source. This is when we say that our circuit-breaker goes to state open. As we have seen, we also have a detection algorithm to go back to the main source again once it is available again. That can, in turn, depend on the error frequency given slow requests or just errors on the back end. We can also define the errors that count, and the ones we can ignore. We can also configure it to simply work as a timeout. Using timeouts, we can force the circuit breaker to try and connect to the original source. This of course has a cost, but it can also make it faster to go back to a closed state from a half-open state. The way our circuit-breaker detects that can be very simple to extremely sophisticated. We need, in that case, to use tools like Locust, Gatlin, JMeter or others, to study precisely how we want that configuration to work best in our favor.

7. References

https://youtu.be/kR2sm1zelI4

https://youtu.be/AiP2_icXpAk

Thank you!

I hope you enjoyed this article as much as I did making it!
Please leave a review, comments or any feedback you want to give on any of the socials in the links bellow.
I’m very grateful if you want to help me make this article better.
I have placed all the source code of this application on GitHub.
Thank you for reading!