Docker and Test Container Waiting Forever Fix

Docker and Test Container Waiting Forever Fix

Fixing and anomaly that not always occurs with port checks vs health checks

Table of contents

For many developers, testing is something that is taken as something less relevant and is an area where in general, most people don't want to think about. For these people there are always many reasons to support this and hardly any of them is a good reason. When we think about tests, unit testing, regression tests, E2E testing, integration testing or "jupiter testing" (as I have heard one day someone say), we are trying our best to make our application reliable for current and future changes. More often than not, the business side of our company has a lot of trouble in understanding why investing in automated testing will save money in the long run. You only need one team member to break the balance and convince business that "testing is good, but a waste of time and money". Business does not see the automated testing in most cases and for them, it is quite an abstract concept. Also, making tests allows you to shine for your IT colleagues (if they are good) but not really for business. And if your intention is purely to show how amazing you are, then of course you may just want to present functionalities to business and if you are that good, and make no mistakes, then why not do that? You'll quickly show them how much work you can do in a short amount of time, and you'll be seen as a genius in your company and your company will pay you more money in the long run. But are you that perfect, immaculate developer that makes no mistakes? All of these dynamics play in many teams, and sometimes we have no choice but to tag along with it. As engineers, we like to be humble and admit that we cannot for-see all problems in the long run. This is why we invest in automated testing. But sometimes the argument of "wasting time" and "loosing money" is actually valid. But only in the concept of taking too long to make decisions and causing extensive discussions for a long time. The case I'm about to present here happened to be in an environment (I cannot say when or where, and I'm not using the exact setting we had) built with a combination of docker-compose and TestContainers. Test containers were being allowed to start in the pipeline, and locally it had been decided to startup docker-compose manually and only afterward start the integration tests. With the Testcontainer code commented out... Yikes! 😱 . Let me just start by saying that if you run into this issue as I did, you should feel no qualms into saying that this is wrong. The idea of commenting code out, in order to be able to run tests is very much error-prone, it involves manual steps and costs a lot of money in the long run.

Here is a list of arguments presented on why this had been done so:

  • The processes take too long to start. It is a very heavy project

  • The containers themselves need to take a lot of resources and Testcontainers is doing something crazy

  • The reason why it takes so long is because of the bean startup.

  • I don't understand why Testcontainers do that, so it's much better to start them locally. It costs no time and you just need to know how to do it. Once you learn it, you just do it out of instinct.

Let me break down all of these arguments right off the bat.


The processes take too long to start. It is a very heavy project

Usually, if you do have this issue, you will run into this issue running locally or running test containers. The only benefit you get from using an already running docker-machine is that you can re-run your tests multiple times without needing to restart the containers. There are severe drawbacks to doing this. As you let your containers run locally, you are forced to remember to stop them if you need to go about running a different project. And this is also error-prone. You are also bound to do something in the code to manage this, or you'll have to keep commenting code out. If you let the framework detect which containers are running, you are already adding a very dependent dynamic to everything running outside the integration test. But let me ask you, why are you running the integration tests so many times in your machine anyway? The need to run tests multiple times locally is usually associated with the need to implement some complex logic. The step to integration test shouldn't be that sophisticated and if it is, you may need to rethink what your project is actually doing or perhaps instead run different kind of tests like E2E tests and/or system tests.


The containers themselves need to take a lot of resources and Testcontainers is doing something crazy

In my experience, worlds like "magic" and something "crazy" relates more to not understanding the framework enough or simply not wanting to be bothered at all in opening an issue to the manufacturer. On the other hand, the resource consumption may come from the startup of the containers or perhaps in case of enterprise frameworks, as an example, the actual initialisation of the beans. If they are proactively being initialised, then you may have an issue in the way you start up the beans. Maybe you need to reduce a connection retry interval just to give an example. It could potentially mean that this alone is an issue you may face in production.


The reason why it takes so long is because of the bean startup

Yes! It could be! So that is probably, as I mentioned above, an issue you have to tackle and improve. It also may be related to the way you start your containers and how you define when the tests actually start. We'll see below what this actually means and how can you control this with Test Containers.


I don't understand why Testcontainers do t\hat, so it's much better to start them locally. It costs no time, and you just need to know how to do it. Once you learn it, you just do it out of instinct

Every new colleague you'll get, will probably run into this issue. As engineers we like to automate things that are necessary. And for every unnecessary manual thing we always strive to turn it into an automated thing. A new colleague will just as easily learn how to do this as they will easily forget it immediately after a while of not doing this. And it does cost time. You will always have to manage commenting code out, making mistaken commits with wrongly commented code, performing the rollback of the same commits and finally managing whatever you possibly made to automate a detection of running containers.


For everything above written, I just want to share with this big IT bite, two ways you can configure the Testcontainers framework initialization strategy and then show the pros and cons of both configurations, we'll then dive a bit into the Testcontainer code itself and some bugs that have appeared and what you can do about it. This way I can hopefully arm you with the knowledge to stay away from the "crazy" and "magic" argument.

In my project located in https://github.com/jesperancinha/jeorg-spring-master-5-test-drives, you'll find several modules related to the Spring Framework. My case is also related to the Spring Framework, which we'll use to illustrate. You can pick this on to apply it to Micronaut, Jakarta or any other enterprise edition you wish to test. The module name is docker-boxing. I chose this term because in The Netherlands people use boxing for the boxing sport but also in the context of an argument.

Inside docker-boxing, you'll find two submodules docker-boxing-health and docker-boxing-port. Both of these modules have an extremely simple implementation of a Spring Web service. They differ slightly in which one has a rest endpoint that works as a health endpoint and the other doesn't have it. I'm not using the actuator purposely because this IT bite is a bite and not a tutorial about the Spring Actuator.

In docker-boxing-health we'll find this class:

@SpringBootApplication
@RestController
class BoxingNewRunner : ApplicationRunner {
    companion object {
        @JvmStatic
        fun main(args: Array<String>) {
            logger.info("Starting server -> ${LocalDateTime.now()}")
            Thread.sleep(TimeUnit.SECONDS.toMillis(2))
            runApplication<BoxingNewRunner>(*args)
        }

        val logger: Logger = LoggerFactory.getLogger(BoxingNewRunner::class.java)
        val startup = LocalDateTime.now()
    }

    override fun run(args: ApplicationArguments?) {
        logger.info("Service started -> ${LocalDateTime.now()}")
        logger.info("Time Elapsed -> ${ChronoUnit.MILLIS.between(startup, LocalDateTime.now())} ms")
    }

    @RequestMapping
    fun health() = "OK"
}

It starts the service and logs the time elapsed since the actual call of the main method up until the startup of the Spring container. The other module, docker-boxing-port, doesn't have the OK endpoint.

The Testcontainer framework allows several types of startup and several types of startup. In this bite, we'll only talk about waiting for a health check and waiting for a port to be opened.

Port Check Based Test Containers

For the port check we have this docker-compose.yml file:

version: '3.1'

services:
    adopt-old-1:
        build: .
        entrypoint: java -jar ~/docker-boxing-old.jar && tail -f /dev/null

    adopt-old-2:
        build: .
        entrypoint: java -jar ~/docker-boxing-old.jar && tail -f /dev/null

It starts two docker containers isolated from each other on port 8080. The port is isolated. Testcontainers will use internal commands in both containers to check when they are ready for the integration tests. Since we have no health checks in this setup we can only resort to port checking:

@SpringBootTest
@ContextConfiguration(initializers = [BoxerInitializer::class])
internal class BoxingOldRunnerTest {

    @BeforeEach
    fun setUp() {
    }

    @Test
    fun `should start test Containers`() {

    }

    private class DockerCompose(files: List<File>) : DockerComposeContainer<DockerCompose>(files)

    class BoxerInitializer : ApplicationContextInitializer<ConfigurableApplicationContext> {
        private val dockerCompose by lazy {
            DockerCompose(listOf(File("docker-compose.yml")))
                .withExposedService("adopt-port-1_1", 8080, Wait.defaultWaitStrategy())
                .withExposedService("adopt-port-2_1", 8080, Wait.defaultWaitStrategy())
                .withLocalCompose(true)
                .also { it.start() }
        }

        override fun initialize(applicationContext: ConfigurableApplicationContext) {
            logger.info("Starting IT -> ${LocalDateTime.now()}")
            logger.info("Starting service 1 at ${dockerCompose.getServiceHost("adopt-port-1_1", 8080)}")
            logger.info("Starting service 2 at ${dockerCompose.getServiceHost("adopt-port-2_1", 8080)}")
            logger.info("End IT -> ${LocalDateTime.now()}")
            logger.info("Time Elapsed IT -> ${ChronoUnit.MILLIS.between(startup, LocalDateTime.now())} ms")

        }

        companion object {
            val logger: Logger = LoggerFactory.getLogger(BoxingOldRunnerTest::class.java)
            val startup = LocalDateTime.now()
        }
    }
}

In this case, the defaultWaitStrategy will, as it name suggests default to the overall way to detect if the containers have started or not. The default way is to check for the opening of port 8080. The containers then start via docker-compose in a lazy way, so that we can measure via the logs the time elapsed. Note that in this particular test I'm using Spring Initializers instead of an abstraction as I normally do. It doesn't matter. It's just a different style, but if you are interested, please check the code of another integration test made with Testcontainers on: https://github.com/jesperancinha/vma-archiver/blob/master/vma-service-backend/src/test/kotlin/org/jesperancinha/vma/utils/AbstractVmaTest.kt. If we run these integration tests we'll see something like this:

Docker Boxing

We see that it took 12 seconds before the containers start. This happens because we need to take into account the time the container took to start plus the time our two applications took to start. If you didn't notice I programmed both services to wait 2 seconds before the SpringBoot process actually starts. This means that there is at least a 4-second delay before anything can start. Then we need to consider the amount of time the services take to start. If you run this in your machine, you'll notice, if you have experience, that maybe the spring boot processes have started somewhat too quickly. This is because, the jars in the containers have not started at all, although port 8080 was available, the application itself was stopped in its tracks because we are only testing for port listening.

Health check based containers

For this case, the setup is a little different. We implement the health checks:

version: '3.1'

services:

    adopt-health-1:
        build: .
        entrypoint: java -jar ~/docker-boxing-health.jar && tail -f /dev/null
        healthcheck:
            test: curl --fail http://localhost:8080 || exit 1
            interval: 1s
            retries: 40
            start_period: 0s
            timeout: 10s

    adopt-health-2:
        build: .
        entrypoint: java -jar ~/docker-boxing-health.jar && tail -f /dev/null
        healthcheck:
            test: curl --fail http://localhost:8080 || exit 1
            interval: 1s
            retries: 40
            start_period: 0s
            timeout: 10s

With this setup, we are now ready to only allow our tests to run once the health checks confirm the run:

@SpringBootTest
@ContextConfiguration(initializers = [BoxingHealthRunnerTest.BoxerInitializer::class])
internal class BoxingHealthRunnerTest {

    @BeforeEach
    fun setUp() {
    }

    @Test
    fun `should start test Containers`() {
    }

    private class DockerCompose(files: List<File>) : DockerComposeContainer<DockerCompose>(files)

    class BoxerInitializer : ApplicationContextInitializer<ConfigurableApplicationContext> {
        private val dockerCompose by lazy {
            DockerCompose(listOf(File("docker-compose.yml")))
                .withExposedService("adopt-health-2_1", 8080, forHealthcheck())
                .withExposedService("adopt-health-1_1", 8080, forHealthcheck())
                .withLocalCompose(true)
                .also { it.start() }
        }

        override fun initialize(applicationContextx: ConfigurableApplicationContext) {
            logger.info("Starting IT -> ${LocalDateTime.now()}")
            logger.info("Starting service 1 at ${dockerCompose.getServiceHost("adopt-health-1_1", 8080)}")
            logger.info("Starting service 2 at ${dockerCompose.getServiceHost("adopt-health-2_1", 8080)}")
            logger.info("End IT -> ${LocalDateTime.now()}")
            logger.info("Time Elapsed IT -> ${ChronoUnit.MILLIS.between(startup, LocalDateTime.now())} ms")
        }

        companion object {
            val logger: Logger = LoggerFactory.getLogger(BoxingHealthRunnerTest::class.java)
            val startup = LocalDateTime.now()
        }
    }
}

With the forHealthcheck method, we check if the containers are healthy. And now it takes about 17 seconds to startup. This happens because we are waiting for the whole application to start and that means having the docker containers fully running

Based on the investigation and conclusions above and my experience, my best advice to you is to make sure you know what you are doing when using Testcontainers. It is a great framework, but we need to know what we are doing very well to avoid the "magic" and "crazy" trap. In my real experience I came across issues in port detection. Making a whole example for a simple bite would be fairly complicated, but it is related to issues reported coming out of the use of their internal class InternalCommandPortListeningCheck. If you look into this class you'll see that they make mandatory use of grep, nc and bash. This means that listening and waiting for a port to be opened may be a bit flaky. In fact, they have issues reported for people who use containers that cannot make use of this: https://github.com/testcontainers/testcontainers-java/issues/3835, and this one https://github.com/testcontainers/testcontainers-java/issues/3317. This means that listening to a port isn't always a guarantee. In my real live example, the port would be open but would reply back to testcontainers much later on. In some cases, the level of processing involved was such that my computer would shut down before Testcontainers even started. And this is where the whole problem came about. Changing this to a health check, solved my real life problem. We have seen before that listening to a port may actually speed up the startup of the containers but then the application may not have started correctly. In listening to a health check, we are sure that our application has started. We loose in speed but gain in predictability. If we need a container exclusively for a port, it may be more complicated to know how do we want to speed up the process.

My whole point with this experiment was to show and prove with an example, some issues opened in GitHub for the testcontainers framework and my own personal experience, that investing in more performant and predictable testing is always an investment that can potentially be very possible. You know you are right when you see such a thing, and you want to make a difference. But remember, your social skills can overshadow or show off your brilliance, but with a good background and understanding of social skills can go a long way and bring you and your project / company to a very successful and lucrative relation.

Thank you!

I hope you enjoyed this article as much as I did making it!
Please leave a review, comments or any feedback you want to give on any of the socials in the links bellow.
I’m very grateful if you want to help me make this article better.
I have placed all the source code of this application on GitHub.
Thank you for reading!