Monitoring Secure Coroutines and WebFlux Reactive applications with Prometheus, Grafana, and InfluxDB — A webcams example

Monitoring Secure Coroutines and WebFlux Reactive applications with Prometheus, Grafana, and InfluxDB — A webcams example

1.Introduction

For the past years, we’ve seen an increasing interest in using safe applications and making them as reactive as possible. For this, there are many technologies developed throughout the years, and today the most popular ones seem to be Spring WebFlux and Kotlin Coroutines. As I’ll show in this article, both technologies complement each other and can live in the same JVM ecosystem. As you probably have noticed already, there are many times discussions over which one of them offers the best performance, but essentially they both assume the same principle, which is to avoid anything that is blocking as much as possible. WebFlux takes advantage of the Observer Pattern using Flux and Mono to deliver collections and single objects and Coroutines use Flow and suspend functions to deliver exactly the same respectively. In this article, we will look at one of the many ways to keep an application's behavior closely under watch and use Prometheus as a data scrapper, InfluxDB as a persistence mechanism, and Grafana to visualize our data. The project for this article is located on GitHub. When an application gets developed, the assumption that we can move straight to production because the unit tests (TDDTest Driven Development) and the integration tests did not detect any problem can still be very risky. Even when an experienced tester has made his best effort to pen test the application and has made all the necessary business logic checkups, we are still relying on a human factor, and we still haven’t considered system or network issues. BDD (Behaviour Driven Development) many times comes into the equation as a means to prevent errors in production. There is always an extra element of reliability given that BDD checks the behavior of a running environment. A typically known framework used to achieve this is Cucumber. In a case scenario, and only considering numbers, which gives the business a sense of how safe are we to go to production, is that we have about 90% code coverage, we have given coverage to the logic that the IDE (Integrated Development Environment) did not detect, the BDD tests are fine, and we have a superb and ultra professional DDD (Domain Driven Design) expert which assures us that all the architecture and boundaries match all the expectations of the PO (Product Owner). We are also very proud to have a team where everyone knows all software design patterns by heart and uses them and thinks about them in their daily developments. Furthermore, everyone knows how to implement and use SOLID (Single responsibility, Open-Closed, Liskov-substitution, Interface Segregation, Dependency Injection) and ACID (Atomicity, Consistency, Isolation, Durability) principles. Everything seems fine, we go to production, and then we realize that the application isn’t really doing what it was supposed to do. Furthermore, it seems to display all sorts of unexpected behavior. In these scenarios, the problem may well be that we did not have any concerns about the CPU, GPU, memory usage, GC (Garbage Collection), Security and the list goes on. We didn’t check the downtime of the application, we didn’t check the resilience of the application, availability, and robustness. We also didn’t figure out what to do in case of concurrency, multiple users and how much data actually needs to flow in and out of our application. Monitoring can offer a lot of help to prevent these situations.

2. Technical Requirements

For this article, it is helpful if you have some experience and understanding of Spring and Spring Boot, albeit not essential. We will check out the complete details of the implementation, so if you don’t know some basics, then don’t worry, because I will try my best to keep it easy. What’s really important is to have the following installed on your machine:

  • Java 17 (any java 17 distribution will do. My best advice is to use SDK Man) to achieve this and use 17-open.

  • Latest gradle (At least version 7.5 in order to be compatible with Java 17)

  • A good IDE (IntelliJ or Eclipse for example)

  • Docker Desktop , but only if you are using an office-oriented machine, with a system like MAC-OS or Windows. My best advice is to just use a Linux machine with 12 cores if possible.

3. Article Goals

When talking about monitoring, we mostly make a few assumptions. It’s easy, it doesn’t take too much time to set up, and we just want to check if the system is ok. Further, there is this generalized assumption that we shouldn’t complicate it too much. Indeed, these are all valid points, up to a point. On the other hand, if we think about what actually needs to be done to make monitoring a reality in our environment, then we quickly begin to realize that there are quite a few moving parts to make monitoring a reality. Let’s think about what we need on a very high level:

  1. We want metrics

  2. We want to have those metrics in our system

  3. We need to fetch metrics

  4. We need to store metrics

  5. We need to visualize these metrics

  6. We need an alert system

  7. Our system should not be impaired in any way in terms of performance

Looking at point 1, mostly, we are sure that we need metrics. Now we need to do the hard work of thinking about “which metrics do we need”? Do we need to monitor the Garbage Collection? Do we need to monitor resources? Do we need to monitor memory usage? Maybe we need all of these metrics, or maybe we already know firsthand what we want to monitor. It’s good to have an idea beforehand of what we want. For points 2 and 3, we need to make sure that we have considered using a push or a pull mechanism. In this article, we are going to look at the pull mechanism. We will see this in both Prometheus and InfluxDB. The biggest advantage of a pull mechanism is that, once our application is running, we can control at any time what we want to receive, by scrapping the data at specific and configurable points in time. A push mechanism implies that the application is constantly sending metrics through the wire even when we want to scale down our metrics fetching mechanism. In other orders, monitoring will always make a small dent in the application performance, but we can scale that down if we use a pull mechanism. What’s important is to make that decrease in performance completely negligible. Next, in point 4 we need to store our data. Fetching mechanisms come many times with ephemeral storage engines. These engines are so called because they can be reset by restart, they are only kept in memory, they have a rolling log or simply they are not well-designed to have persisted data. This is what happens with Prometheus. Although it does have provisions to persist data, it is still very basic and with some limitations in their own words:

If you run the rule backfiller multiple times with the overlapping start/end times, blocks containing the same data will be created each time the rule backfiller is run. All rules in the recording rule files will be evaluated. If the interval is set in the recording rule file that will take priority over the eval-interval flag in the rule backfill command. Alerts are currently ignored if they are in the recording rule file. Rules in the same group cannot see the results of previous rules. Meaning that rules that refer to other rules being backfilled is not supported. A workaround is to backfill multiple times and create the dependent data first (and move dependent data to the Prometheus server data dir so that it is accessible from the Prometheus API).

in prometheus.io/docs/prometheus/latest/storage

For these reasons, we need another element in our system that will make sure that the longevity of our data is determined by us and not the system. We will achieve this in our example with InfluxDB. Once our data is stored we now need a clean and easy way to visualize data. That is point 5. In many systems, the visualization software also allows us to configure our requirements for point 6. Alert systems usually go hand in hand with the visualization interface we are using. This is where we will see Grafana in action. In terms of point 7, that can be a fallacy in itself. There is always a very small decrease in performance, which should always be negligible as we discussed above. We can also say that if that decrease isn’t perceptible then it’s because it does not exist. It is a fair point, and it makes complete sense, but it’s important to think about it. Misconfigured monitoring solutions can cause latency issues and make the application unresponsive or even bring a system to a shutdown.

4. Domain

Our service is a simulation of a real system. Ideally, we would surveil objects in a museum for example. We could monitor the location of boats passing by a river, check the weather in airports, the affluence of people at a concert, etc. If we have a camera, we can imagine thousands of cases of what we can do with them. In this project, we will make this very simple. We will surveil vegetables on a wooden floor! This way nothing else will matter but the goal of our project:

One service will provide two resource endpoints. One endpoint will give you the image captured at a given time to a certain camera. The other endpoint provides information about the object we are filming. For our application, we are not interested in having this information separate. Instead, we want to access this information in an aggregated way. We also do not want the front end to do all the hard aggregation work. That processing should be left to a new service that will wrap up this “raw data” service and provide the desired aggregated information. Once we have this service available, then we want to be able to visualize our data with a front-end application. So now we know that we need at least three services. One for raw data, one for the aggregation data, and another to provide a GUI (graphic user interface) to the user. Consequentially, we also need to be able to monitor these services. This means that we’ll need at least three different metrics providers working as a sidecar along these three services. With our 3 services to 3 metric providers, we need a scrapper to be able to visualize this data. Once we have the scrapper, we also want to be able to visualize this data and we also want to be able to store this data. So this means that we’ll need another 3 services. One scrapper, one visualizer, and a persistence service. Here is a rough sketch of how this works:

Moving Objects project

5. Implementation

Let’s start to make our application! The case we are going to look at is a system that will surveil the movement of objects. In a previous version of this article, the cameras were provided by two rest interfaces of RapidAPI. One provided an airport list and another the public cameras around the airport. These were the Airport Finder API and the Webcams Travel API. However, because these apps are external, I could not have any control over them. I also wouldn’t be able to control any changes in the model or if the application would for one reason or the other become private, on subscription, trial, or any other form of non-public access. At some point in time, the original example of this article became effectively non-functional, non-compilable, and of course not possible to make a demo of. This means that now, instead of surveying locations around airports, we now have an example of surveilling moving objects on a grid of locations. I simply took a few photos of these vegetables to create an idea of movement. In order to make things a bit more interesting, I’ve created 2 rest applications:

  • moving-objects-jwt-service — A secured application, JWT capable, providing two separate endpoints. One delivers the object/vegetable information and the other the camera location in a 2-dimensional grid. This application is opened at port 8081. This application has been implemented with Spring Boot and Kotlin Webflow Coroutines. It uses a r2dbc connection against a PostgreSQL database on port 5432.

  • moving-objects-rest-service — This one will serve the front-end application and is facing the user. We can optionally start it with OAuth2 Okta Authentication capability. It will authenticate, in order to communicate with the JWT service. It runs on port 8082. This application has been implemented with Spring Boot and WebFlux.

I have also created a GUI (Graphic User Interface). In this case, I am using Angular and for the UX (User eXperience) design implementation I am using Angular Materials. The reason for the creation of the 2 rest services is because I find it to be one great way of showing monitoring in action. We make one service depend on the other, where the goal of the edge service is to aggregate both sources of information. Therefore, we can analyze more different data behavior cases. I have also implemented OAUTH2 for the same reasons. In other words, I wanted one service to have different behavior than the other. We will see a glimpse of that at the end of this article.

6. Application Setup

In this section, we’ll have a look at the application overview, how everything is set up, and the containerized environment. Let’s start with the application overview:

Application Overview

In this simplified overview, we can see that, before we get any data, we are going through 3 services. We have an angular service running on NGINX, to provide the front-end, one rest server with okta authentication to provide access to an aggregation of data, and finally, a service that provides the raw data using JWT to access it securely. Let’s have a closer look at what the containerized environment looks like. In our previous requirement enumeration, we saw that we need at least 4 things. We need three application environments, a data resource, a permanent persistence mechanism, and a visualization environment for the monitored data. In our case, these are:

  • JRE 17 — We’ll use this runtime in two services. As we have seen before, one runs in the Data Source domain and the other in the aggregator domain. In each of the environments, we’ll be running not only the applications but also the metrics service as a side-car to the spring-boot applications. Spring boot can pack external libraries like the ones provided by Prometheus and use them in its actuator endpoints.

  • NGINX — With NGINX, we can deploy our front end as a static deliverable. It runs in the UX domain.

  • Node — Node JS enables us to run a Prometheus service to provide metrics to the Prometheus scrapper. It runs in the Metrics domain. The runtime does not have to share the same domain as the machine it's running in. With Node, we can start a service, not necessarily to get metrics about NGINX, but the machine where it runs.

  • Prometheus — This is the scrapper. A scrapper in this case is just a process that runs in fixed timed intervals a scrapping process to retrieve important metrics from target machines. It runs in the Metrics Domain.

  • InfluxDB — This is our persistence mechanism. InfluxDB also has the power to configure scrappers. They run the same way as Prometheus, but we can store the data. It includes its own metric visualization application. It runs both on the Metrics Domain and in the UX Domain

  • Grafana — This is our visualization tool. We can visualize all needed metrics that are scrapped by Prometheus. It runs, of course in the UX Domain.

We can make everything run in our local machine separately, but we can also speed up the development process by the use of Docker images and Docker compose. Besides speed, the other big advantage is that we can run the whole environment with one single command in a single docker-machine.

Let’s have a look at a more detailed version of our system:

Localhost configuration using Docker

As seen in the diagram, all our moving parts are containerized in one single machine on localhost. No extra configuration is needed when using Docker. It’s important to notice from the image that the only ports that are supposed to be part of the public domain are 3000, 9090, and 8080. In a local environment, we will expose all ports. Further down this document, we will see how the security version of this project works. For that one, we will also need port 8082. All other ports are inside the docker domain. Every container is aware of each other by the names we have given to them. We’ll see how to translate this system to docker-compose further on in the article.

7. Data model

Our data model is exceptionally simple. We only have two tables. One with the Object Info and the other one with the Moving Object info:

create schema if not exists mos;
drop table if exists mos.moving_object;
drop table if exists mos.info_object;
create table if not exists mos.moving_object(
    id UUID,
    code VARCHAR ( 50 ) NOT NULL,
    folder VARCHAR ( 255 ) UNIQUE NOT NULL,
    uri VARCHAR(255 ) UNIQUE NOT NULL,
    x INT,
    y INT,
    PRIMARY KEY (id)
);
create table if not exists mos.info_object(
    id UUID,
    "name" VARCHAR(50) UNIQUE NOT NULL,
    "size" INT,
    color VARCHAR ( 255 ) NOT NULL,
    code VARCHAR ( 50 ) NOT NULL,
    x INT,
    y INT,
    PRIMARY KEY (id)
);

Both tables share a code. This code is what unifies the data together. However, this data t isn’t given together to the front end. Instead, we provide this data separately. The table moving_object , can be interpreted as the location, where an object with a certain code is being filmed. The uris are the location of the camera that provides the latest snapshot being taken. The parameter folder is the actual location where the image is being saved. This latest snapshot will be delivered to the client using a streaming endpoint. Evidently, x and y are the coordinates of the object. The table info_object simply contains extra information about the object being filmed. An object, for this project, has special information regarding its name, size, color, and location. The location is here also expressed in coordinates x and y. The reason for this is to distinguish the location of the camera and the location of the object. Having said that we can also argue the coordinates of the object we are filming don’t really matter that much. The start of the project will load the following data in the database:

truncate mos.mos.moving_object;
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'GAR', 'garlic', '/aggregator/webcams/camera/GAR',1,2);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'LAU', 'laurel', '/aggregator/webcams/camera/LAU',2,2);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'LEM', 'lemon', '/aggregator/webcams/camera/LEM',3,4);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'ONI', 'onion', '/aggregator/webcams/camera/ONI',7,2);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'PUM', 'pumpkin', '/aggregator/webcams/camera/PUM',8,1);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'RED', 'red-onion', '/aggregator/webcams/camera/RED',3,6);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'SNA', 'snail', '/aggregator/webcams/camera/SNA',1,7);
insert into mos.mos.moving_object (id,  code, folder, uri,x,y)
values (gen_random_uuid(), 'TOM', 'tomato', '/aggregator/webcams/camera/TOM',0,1);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Garlic', 2, 'white', 'GAR', 1, 2);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Laurel', 4, 'green', 'LAU', 2, 2);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Lemon', 2, 'lemon', 'LEM', 3, 4);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Onion', 4, 'gold', 'ONI', 7, 2);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Pumpkin', 10, 'orange', 'PUM', 8, 1);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Red Onion', 5, 'purple', 'RED', 3, 6);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Snail', 1, 'brown', 'SNA', 1, 7);
insert into mos.mos.info_object(id, "name", size, color, code,x,y)
values (gen_random_uuid(), 'Tomato', 2, 'red','TOM', 0, 1);

8. Code

If you already have had a sneak peek at the code on GitHub, you’ve probably already seen that the code base is quite small per module, but the conjunction of all the code may be a bit complicated to understand. This section is here so that we can see together how the code is built and what we want to achieve. Maybe the best thing to do, to start off is to have a look at the data domain in the moving-objects-jwt-service:

Dao UML for moving-objects-jwt-service

There isn’t really too much special in this UML chart. Essentially we just have two different types and we call them MovingObject and InfoObject. Cameras have got a location in an x, y grid in the same way that Objects do. We consider the MovingObject to be what the Camera is filming. We can and will use these terms interchangeably to refer to the same thing. InfoObject is just some information about the object. In this case, it’s just the size, the color of the object, and its latest location. The latest location never changes in this project. Although the JWT service is mostly implemented using Coroutine data Flows, I wanted to use WebFlux in the same project to show that both can work together in the same way. Although their reactive implementations are different, they can work together, and we can make use of the different features they both can offer. In my case, I really wanted to make use of the collectSortedList function to create a reactive Mono from a list of objects of the type MovingObject. Many people prefer to use the JpaSpecificationExecutor to get a full page in one go. Unfortunately, if we do that, then we get a non-reactive Page in return. It probably doesn’t matter much in the end because effectively a Page is by all means just one record given back to the client. However, I wanted this application service to be completely reactive regardless. And so instead of returning a Page, I get to return a Mono<Page> which keeps the usage of the application possibly to be completely reactive. So now that we’ve looked at the Model, let’s now have look at the controller part:

REST UML for moving-objects-jwt-service

We convert the data into the DTOs in very specific ways but never join the object info with the cameras. In our case, we’ll associate a location with all cameras contained in a specific radius of it. This will give us the full list of all the cameras for a specific search, but not the object's info in relation to their sizes and color. This is for one endpoint. For the other endpoint, we can get the object's info. With this one, we’ll know the object's location, color, and size. What this rest service does not offer is an aggregation of the cameras and object info.

This service also needs to be accessed via security. For this project, we make it very simple and when we run it, we’ll get a warning that we are using one user/password combination configured in the clear in application.properties. Normally, security services have a much more complicated way of providing security. Let’s have a quick look at the code:

@Bean
fun securityFilterChain(http: ServerHttpSecurity): SecurityWebFilterChain =
    http
        .logout { logout ->
            logout
                .requiresLogout(ServerWebExchangeMatchers.pathMatchers(HttpMethod.GET, "/logoff"))
        }
        .authorizeExchange { authorize ->
            authorize
                .pathMatchers("/webjars/**")
                .permitAll()
                .pathMatchers("/info/jwt/open/**")
                .permitAll()
                .pathMatchers("/webcams/jwt/open/**")
                .permitAll()
                .pathMatchers("/v3/**")
                .permitAll()
                .pathMatchers("/actuator/**")
                .permitAll()
                .anyExchange()
                .authenticated()
        }
        .csrf().disable()
        .httpBasic(Customizer.withDefaults())
        .oauth2ResourceServer { oAuth2ResourceServerSpec -> oAuth2ResourceServerSpec.jwt() }
        .build()

What we are doing here is first allowing all matching patterns to be open publicly. These patterns refer to 3 important aspects of the application: Swagger, Actuator, and test endpoints. Swagger allows us to interact with the application via a GUI that allows us to perform requests without needing to use curl , or any other command line instructions like wget. The Actuator endpoint is vital for our project because it provides metrics including the ones Prometheus understands and uses. CSRF must be disabled in our application. The Spring documentation is very clear about why is this true for most cases:

If you are only creating a service that is used by non-browser clients, you will likely want to disable CSRF protection.

in docs.spring.io/spring-security/site/docs/5...

The authentication for our token will be asked via a specific endpoint that we’ll see later on. For now, I just want you to get the idea that with our configuration, we can access any of our endpoints either with a username/password combination or with a JWT token. The JWT token, however, will only be given via the token endpoint. We protect our application though, with the OAuthResourceServerSpec. This allows us to configure several types of security settings, but for the purpose of this article, we just create the standard JWT support. Reactive programming also implies reactive security because security is something inherently bound to any request we make to our services. The principle though is exactly the same as in non-reactive applications. We first configure our AuthenticationManager:

@Bean
fun authenticationManager(): ReactiveAuthenticationManager {
    return UserDetailsRepositoryReactiveAuthenticationManager(userDetailsService())
}

We then need to configure encoders and decoders for the JWT token itself and also for the password we are using to get the initial JWT request for the application:

@Bean
fun jwtDecoder(): ReactiveJwtDecoder = NimbusReactiveJwtDecoder.withPublicKey(rsaPublicKey).build()

@Bean
fun jwtEncoder(): JwtEncoder =
    NimbusJwtEncoder(ImmutableJWKSet(JWKSet(RSAKey.Builder(rsaPublicKey).privateKey(rsaPrivateKey).build())))

@Bean
fun passwordEncoder(): PasswordEncoder {
    return BCryptPasswordEncoder()
}

Codecs for the JWT rest service

Once we have the codes, we can now configure our UserDetailsService:

@Bean
@SuppressWarnings("deprecated")
fun userDetailsService(): MapReactiveUserDetailsService? {
    val user = withDefaultPasswordEncoder()
        .username(username)
        .password(password)
        .roles(*roles)
        .build()
    return MapReactiveUserDetailsService(user)
}

UserDetailsService

We can already see that this UserDetailsService is something different from its Servlet counterpart (we also all this in street language MVC). The difference is that this service is implemented in a reactive way. We are using Netty here and not Tomcat as a SpringBoot embedded Service. This means that Netty, works in a reactive way, which means that we can implement our calls to the service in a reactive way. This in turn improves the capacity of our machine, given that it will be able to hold more requests at the same time, even when security is involved. In our case, our service is safeguarded with username/password admin/admin. It is always good to be reminded that at this point security can be much better implemented. Our very unsafe admin/admin combination suffices though for the purpose of this exercise. Let’s now have a look at our service that will face the front-end. In the fully functional version of this service, we’ll use Okta to perform OAuth authentication. For the end-to-end tests, Okta will be disabled, but we will go through the whole application. Our application is called moving-objects-rest-service . The first contact with the previous application starts obviously at the data layer level:

blogcenter

DAO Layer of the Okta project

Looking closely at these types, we can see how directly they relate to the Dto’s created in the JWT project.

blogcenter

Service Layer of the Okta Project

In the service layer above, we see that we have dedicated services for each endpoint and next to it, we use an ObjectAggregatorService. This service will create the payload that we are interested in. The aggregation follows a few interesting implementations related to reactive programming. Let’s have a look at a few methods:

public Flux<MovingObjectDto> getObjectsAndCamsBySearchTermAndRadius(String term, Long radius) {
    return objectsService.getObjectsByTerm(term)
            .flatMap(objectDto -> {
                CoordinatesDto coordinates = objectDto.coordinates();
                return webCamService.getCamsByLocationAndRadius(coordinates.x(), coordinates.y(), radius)
                        .collectList().map(webCamDto -> Pair.of(objectDto, webCamDto));
            })
            .map(webCamDtoPackage -> {
                final MovingObjectDto movingObjectDto = webCamDtoPackage.getFirst();
                webCamDtoPackage.getSecond().forEach(webCam -> movingObjectDto.webCams().add(webCam));
                return movingObjectDto;
            }).distinct();

}

Creating the MovingObjectDto

In this case, we are using purely Spring WebFlux. We first search for the object information by a search term. In the second step, we search for locations related to it by radius. So, in other words, we are going to retrieve all cameras contained within a specific radius pertaining to an object we have searched by a search word. We must never forget that when using Coroutine Flows or WebFlux, we are not doing imperative programming, and we are not doing OOP (Object Oriented Programming). We are using Functional Programming, implementing the Observable Pattern, and using Reactive Principles. What this means is that the code will not execute immediately instead, it will be executed when Netty is free to do so. It is probably good at this point to remind ourselves that by only using the Spring WebFlux libraries in our dependencies, we are no longer using Tomcat by default. We are using Netty at this point. The reason I want to make a statement here is that we see very often people using Tomcat in an interchangeable manner with Spring Boot. However, SpringBoot isn’t necessarily run by Tomcat. It can be run with Undertow and Jetty, and in the case of reactive services, it can then be run with Netty. It is important to make this distinction in advanced types of service implementation. We can now have a quick look at what the Controller layer looks like:

Controller Layer of the Okta Service

This layer is what allows us to work with our payload which has the following form:

Payload example to serve the front-end application

As we can see, we are providing details about our object, but also the details about the camera. We have thus, aggregated two endpoints into one in a reactive way that will be managed by Netty. Finally, we can have a look at how the optional security is implemented with okta. The only thing we need to do is to first add the okta libraries and a few extra helpful libraries:

implementation("com.okta.spring:okta-spring-boot-starter:3.0.7")
implementation("com.fasterxml.jackson.module:jackson-module-kotlin:2.18.2")
implementation("me.paulschwarz:spring-dotenv:4.0.0")
implementation("org.springframework.security:spring-security-web:6.4.2")
implementation("org.jesperancinha.objects:moving-objects-security-dsl:1.0.0")

Dependencies used to provide security

The actual relevant dependency is of course the okta-spring-boot-starter library. The others are important because they provide support. They are not per se necessary. They are, however necessary for the current version of the project and its library versions. The last library, moving-objects-security-dsl, is a very small library I made in order to only allow Okta security configurations to occur when the library is present. Trying to disable okta via spring profiles, was revealed to be an odd thing to do, and it just wasn’t helpful for this project to make something like that. So the library just contains one single class:

@Configuration
class SecurityConfiguration(
    @Value("\${objects.jwt.public.key}")
    var rsaPublicKey: RSAPublicKey,

    @Value("\${objects.jwt.private.key}")
    var rsaPrivateKey: RSAPrivateKey,

    @Value("\${objects.jwt.username}")
    var username: String,

    @Value("\${objects.jwt.password}")
    var password: String,

    @Value("\${objects.jwt.roles}")
    var roles: Array<String>,
) {

    @Bean
    fun securityFilterChain(http: ServerHttpSecurity): SecurityWebFilterChain =
        http
            .logout { logout ->
                logout
                    .requiresLogout(ServerWebExchangeMatchers.pathMatchers(HttpMethod.GET, "/logoff"))
            }
            .authorizeExchange { authorize ->
                authorize
                    .pathMatchers("/webjars/**")
                    .permitAll()
                    .pathMatchers("/info/jwt/open/**")
                    .permitAll()
                    .pathMatchers("/webcams/jwt/open/**")
                    .permitAll()
                    .pathMatchers("/v3/**")
                    .permitAll()
                    .pathMatchers("/actuator/**")
                    .permitAll()
                    .anyExchange()
                    .authenticated()
            }
            .csrf().disable()
            .httpBasic(Customizer.withDefaults())
            .oauth2ResourceServer { oAuth2ResourceServerSpec -> oAuth2ResourceServerSpec.jwt() }
            .build()

    @Bean
    fun authenticationManager(): ReactiveAuthenticationManager {
        return UserDetailsRepositoryReactiveAuthenticationManager(userDetailsService())
    }

    @Bean
    fun jwtDecoder(): ReactiveJwtDecoder = NimbusReactiveJwtDecoder.withPublicKey(rsaPublicKey).build()

    @Bean
    fun jwtEncoder(): JwtEncoder =
        NimbusJwtEncoder(ImmutableJWKSet(JWKSet(RSAKey.Builder(rsaPublicKey).privateKey(rsaPrivateKey).build())))

    @Bean
    fun passwordEncoder(): PasswordEncoder {
        return BCryptPasswordEncoder()
    }

    @Bean
    @SuppressWarnings("deprecated")
    fun userDetailsService(): MapReactiveUserDetailsService? {
        val user = withDefaultPasswordEncoder()
            .username(username)
            .password(password)
            .roles(*roles)
            .build()
        return MapReactiveUserDetailsService(user)
    }
}

Secure configuration

What could have also been made is one Bean with profile prod and another bean with profile test and then just make the test bean a completely open website, with something like pathMatchers(“/**”).permitAll(). However, doing so, would still imply okta initialization any way, and a level of unpredictability that I found to be unreasonable for the purposes of this project, and since Gradle is amazing in configuring assemblies the way we want I created thus a parameter configurable build using:

if (project.hasProperty("prod")) {
    implementation("com.okta.spring:okta-spring-boot-starter:3.0.7")
    implementation("com.fasterxml.jackson.module:jackson-module-kotlin:2.18.2")
    implementation("me.paulschwarz:spring-dotenv:4.0.0")
    implementation("org.springframework.security:spring-security-web:6.4.2")
    implementation("org.jesperancinha.objects:moving-objects-security-dsl:1.0.0")
}

Configurable build with Gradle

The configuration in SpringBoot is as simple as this:

# Okta
okta.oauth2.issuer=${ISSUER}
okta.oauth2.client-id=${CLIENT_ID}
okta.oauth2.client-secret=${CLIENT_SECRET}
#okta.oauth2.redirect-uri=http://localhost:8082/aggregator/login/oauth2/code/okta
okta.oauth2.post-logout-redirect-uri=http://localhost:8082/aggregator/signout
.env.filename=.okta.env

Okta configuration

With this and if we want to run the secure version of the application we can create our build with the following command line:

gradle -Pprod clean build

Finally, module moving-objects-gui, provides a GUI to us where we can visualize our applications. We won’t go into details about it, but it is important to understand that we will be deploying it in NGINX, and we will also want to monitor that machine.

9. Monitoring Endpoints

In previous steps, we have looked into how the architecture is designed, and we have looked at our domain. We have also had a look at the code and took a dive into some interesting aspects of the implementation. We should have a very clear view of what the system is doing at the moment and how all the moving parts are working. We will now go step-by-step through all the elements of the monitoring system. In this part of the article, it’s important to bear in mind that this system can be run on a local machine. All moving parts are part of a docker-compose setup. In this article, I wanted to make an example that would be easy to understand and modify and this is why every single component has its own custom Docker image. Also, let’s keep in mind that the examples I’m showing happen after we have successfully run docker-compose. Further in this article, we’ll go through the details. For now, let’s keep in mind that we can run two versions of our system. For the secure version, we can run:

make dcup-full-action-secure

or if we want to run it without okta, we can then run:

make dcup-full-action

For now, let’s have a look at how the metrics can be configured in the different containers. For our example, we have 3 containers with important metric configuration needs. They are the moving-objects-jwt-service, the moving-objects-rest-service and the moving-objects-gui . For the first two services, we need to configure Prometheus for Spring Boot:

implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("io.micrometer:micrometer-core:1.14.2")
implementation("io.micrometer:micrometer-registry-prometheus:1.14.2")

Micrometer dependencies

This is the micrometer dependency that is needed to have all metrics written and exposed in the Prometheus format. In order to activate it, we also need to activate the Spring Boot’s actuator. This will still not open the endpoints though. These are actually just the libraries that will allow the endpoints to be generated. To open our endpoints in our project we need to explicitly configure Spring Boot for that:

management.endpoints.web.exposure.include=*
management.endpoint.shutdown.enabled=true
management.endpoint.metrics.enabled=true
management.endpoint.prometheus.enabled=true
management.endpoint.httptrace.enabled=true
management.metrics.export.prometheus.enabled=true
management.trace.http.enabled=true

Enabling Metrics including Prometheus

Both rest services contain the same configuration. The actuator is needed because it provides important metrics that come out of the box in Spring Boot and can already be indirectly read by Prometheus. The micrometer properties from the io.micrometer core and theprometheus libraries provide metrics using the same format as Prometheus. This allows for a much easier data reading in the Prometheus processes from the metrics endpoints. With these simple configuration lines, we are enabling endpoints to be created, and we are also activating other tracking mechanisms. All of these management points are part of the Spring Boot Actuator. However, a Prometheus actuator endpoint is only autoconfigured in the presence of the Spring Boot Actuator. Hence, the reason for needing the micrometer libraries and the actuator libraries simultaneously. Prometheus has its own scrappers. They will run processes against Spring Boot applications and scrape all of those metrics periodically. We can also make our own metrics and make them available through the actuator. There are many different ways of doing this and one example is to play a bit with trace. In our metrics we can find this endpoint:

Looking at the result we find:

Unchanged payload to trace requests

Here we have COUNT, TOTAL_TIME, and MAX. These are in their essence, the number of requests made to the application, the total time they took to respond, and the maximum time it took to get a response from a request. These are all important metrics. With these metrics, we can have an idea of how many people are looking into our web application. For ratings, it’s a very handy metric. We want to monitor the performance of our application. Using total time can be an option. With Prometheus, we have much better options than this, but since we have count and total time, we can actually get the average response time for a certain endpoint. As shown, we can also read the maximum time an endpoint took to respond. This can be important when noticing that if one endpoint exceeds a certain time to respond then that endpoint may have a latency issue. Therefore some sort of action must be made in terms of improving it. What we don’t have in these metrics, is the minimum amount of time it took for an endpoint to reply. We just don’t have data for that. At no point, we have registered the time of each request. The alternative to that is to implement our own HttpTraceRepository:

@Repository
public class CustomTraceRepository implements HttpExchangeRepository {
    private final AtomicReference<Queue<HttpExchange>> lastTrace = new AtomicReference<>(new LinkedList<>());

    @Override
    public List<HttpExchange> findAll() {
        try {
            return new ArrayList<>(lastTrace.get());
        } catch (Exception e) {
            return List.of();
        }
    }

    @Override
    public void add(HttpExchange trace) {
        val httpTraces = lastTrace.get();
        if (httpTraces.size() > 10) {
            while (httpTraces.size() > 10) {
                httpTraces.poll();
            }
        }
        if ("GET".equals(trace.getRequest().getMethod())) {
            httpTraces.add(trace);
        }
    }
}

CustomTraceRepository

This creates an implementation of the HTTP tracing, This has been enabled by the HTTP tracing flag as we saw before: management.trace.http.enabled=true. We can notice already that this implementation also carries a repository. Simply speaking, I just created a FIFO (First In/ First Out) queue. In other words, I’m just keeping 10 elements in the repository at the time. This code is only responsible for the recording of GET requests. We can change this to perform other or the same operation to other requests. The focus here is just the fact that we can change metrics the way we please and reprogram some of them to report customized data directly into Prometheus or any other monitoring fetching tool. Traces will carry the following format as an example:

Renewed Trace Metric Request

If we look at the last element of this request, we find timeTaken. This is the response time of each GET request. Our MIN could in this case be an implementation of the minimum timeTaken in this array. It’s important to notice that in this case, we are measuring milliseconds. Why is the minimum time needed? Perhaps our intention is to find the peaks in our performance instead of the peaks in memory usage or CPU. Perhaps if we know when our requests were faster, we can then also identify where we potentially had a better code or system as a whole. This endpoint is accessible in this project on endpoint http://localhost:8080/aggregator/actuator/httptrace. We just went through the essentials for the Spring Boot back-end support for Prometheus. Let’s have a look at how NodeJS can provide the same endpoint for Prometheus. In our example, we have a fully developed application in Angular. We could have used NGINX to deploy it directly. However, in that case, NGINX itself would have had to have been configured in order to provide the endpoints to Prometheus. However, NodeJS provides a very interesting solution by means of the libraries: prometheus-api-metrics and prom-client. There is a server.ts example in the code, but our focus will only be at this point on two crucial lines of that code:

import "es6-shim";
import moment from "moment";
import "reflect-metadata";

import apiMetrics from "prometheus-api-metrics";

import express from "express";

class Server {

    public static bootstrap(): Server {
        return new Server();
    }

    public app: any;
    private port = 4000;

    constructor() {
        this.app = express();
        this.app.use(apiMetrics());
        // tslint:disable-next-line:no-console
        this.app.listen(this.port, () => console.log(`Express http is started ${this.port}`));
        this.app.on("error", (error: any) => {
            // tslint:disable-next-line:no-console
            console.error(moment().format(), "An error has ocurred!", error);
        });
        process.on("uncaughtException", (error: any) => {
            // tslint:disable-next-line:no-console
            console.log(moment().format(), error);
        });
    }
}

const server = Server.bootstrap();
export default server.app;

server.ts metrics supplier

The lines we are interested in, or should I just say the one line we care about is: this.app.use(apiMetrics()); The whole server.ts is an implementation of a simple NodeJS service that runs on express. Our endpoints for Prometheus metrics can be seen in our running example on endpoints:

The metric format will be something like this:

Example metrics exposed by Prometheus enabled

11. Setting up Prometheus

To get Prometheus running, we finally need to configure all of what we have discussed into a prometheus.yml file. Our file looks like this:

global:
  scrape_interval:     15s
  evaluation_interval: 30s

scrape_configs:

  - job_name: aggregator-reactive-rest
    honor_timestamps: true
    metrics_path: /aggregator/actuator/prometheus
    static_configs:
      - targets:
          - moving-objects-rest-service:8082

  - job_name: objects-jwt-reactive-rest
    honor_timestamps: true
    metrics_path: /objects/actuator/prometheus
    static_configs:
      - targets:
          - moving-objects-jwt-service:8081

  - job_name: moving-objects-gui
    honor_timestamps: true
    metrics_path: /metrics
    static_configs:
      - targets:
          - nginx:4000

Full Prometheus configuration file

Notice that at the end, the configuration is pointing to our InfluxDB. We would be setting our scrapping interval to 15 seconds, and we would evaluate each scrape for the past 30 seconds. However, those two lines are no longer usable. InfluxDB has grown into its own scrapper framework which opens the debate of Prometheus is still a necessary thing. Here, however, we are just looking at ways to configure our monitoring application. What is important though, is that we see that we can configure our routes in a YAML fashion with property static_configs. In all three, we configure three different jobs that will scrape the data we need. Scrapping in this case means that the data will be accessible to any external visualizing tool like the Grafana framework we’ll see further on.

Prometheus Dashboard

12. Setting up InfluxDB

One of the greatest hurdles with working with any architecture is the persistence layer. Prometheus does not provide an easy-to-configure durable and persistent storage system. This is something that is expected from Prometheus, given that Prometheus is mostly used as a scrapper, and as we have seen before it is not easy to get it from being an ephemeral data storage system to a persistent data storage system. This is why we have such databases as InfluxDB. This is a storage system specifically designed to store metrics. Current days, it provides much more features, and it has its own data scrappers that work pretty much in the same way as Prometheus does. Let’s have a look at how we can connect to it. InfluxDB does not need to be connected to Prometheus to store metrics anymore. In the old days, we would configure Prometheus to push the data to InfluxDB by means of Jobs. In current days we just need metric endpoints. Then InfluxDB can scrape those metrics and can even provide visualization graphics and dashboards, much in the way Grafana does. We can have it running and just use it within our system. To check that it’s actually up and running we can just run the following command:

influx ping

For this, we need to have the influx client installed. When starting the demo locally, Cypress will run through the current InfluxDB installation, and it will create buckets and scrapers. These will be configured to store data in different buckets for different endpoints. InfluxDB comes itself with Prometheus metrics endpoints. Configuring InfluxDB from the GUI is very easy and I don’t think it is relevant to show how these buckets and scrappers get configured. However, it is important to visualize what we get after the run is successfully completed. This is what we get after configuring the buckets. When we run Cypress the first thing it creates is the buckets for our scrappers to save the data:

When the buckets are created, we can now create the scrappers:

Scrappers in InfluxDB

And finally we can create an Application Token:

Application Token in Influx DB

This application token is needed for something very interesting in InfluxDB, which made it into an even more independent platform. As we see from above, InfluxDB allows to actively scrape metrics from our applications.This is all good, but we may want to this the other way around. If we want to passively receive data in InfluxDB, we then need an external tool to do that. Prometheus has that functionality still, but, because of the security requirements, it is quite cumbersome to achieve it. InfluxDB, however, allows for external engines to connect with it quite easily including the ones provided by Prometheus. They all this concept Telegraf. A Telegraf is essentially a tool that loads these engines and starts actively collecting data from a Prometheus payload and pushes it to InfluxDB. We do not need to download or install anything other than Telegraf to do this and to configure it, we can edit the telegraf.conf file. For our project, it looks like this:

## Collect Prometheus formatted metrics
[[inputs.prometheus]]
  urls = [
  "http://localhost:8080/objects/actuator/prometheus",
  "http://localhost:8080/aggregator/actuator/prometheus",
  "http://localhost:8080/metrics"
  ]
  metric_version = 2

## Write Prometheus formatted metrics to InfluxDB
[[outputs.influxdb_v2]]
  urls = ["http://localhost:8086"]
  token = "$INFLUX_TOKEN"
  organization = "Moving Objects"
  bucket = "mos"

InfluxDB

The important thing for now is just to focus a bit on the self-explanatory parameters of this file. Essentially we need urls to let telegraf know where to scrape our metrics from, and we need to specify the destination of those metrics. For the destination we need to provide the location of InfluxDB with urls, the token we’ve generated before, the organization and the bucket. It is important to understand that the unique identifiers of our location in InfluxDB are organization and bucket. These are mandatory. At the top of these parameters we specify the type of the metrics source endpoints with inputs.prometheus and the destination type with outputs.influxdb_v2. When running the demo, the build will start, then cypress will run and create the token at /docker-files/telegraf, and it will finally start the telegraf container in the same network of docker-compose. In other words, this will all be done automatically done for us so that we can focus on how the setup is made. If we get the demo successfully, with script make start-demo running, we’ll then be able to create graphs like this one:

Dashboard visualization for InfluxDB

If you notice that data isn’t coming in, it could be that telegraf hasn’t started yet. We can start telegraf again by running:

make start-telegraf-container

13. Setting up Grafana

Grafana has a different way of configuring than Prometheus. Here we don’t need any code. Grafana is based purely on the contract it has with Prometheus. Before continue, let’s recap what we have so far:

  • 2 Spring Boot applications and 1 NodeJS application

  • Persistent Metrics Database InfluxDB

  • Metrics Scrapper and management with Prometheus

We will use Prometheus with Grafana to generate graphs from Prometheus only. A detailed discussion on how graphics are built with Grafana is a topic on its own and removes the focus of this article. What’s important to understand is how can we make our Grafana configuration also persistent. We want to keep our graphic configuration persistent so that we can use them in any Grafana environment. When we look at Grafana for the first time, we become immediately aware that we have to create a data source. With this action, we also create a Dashboard provider. Dashboard providers are kept in YAML files located in /etc/grafana/provisioning/dashboards/ in Grafana. Let’s look at our example on dashboard.yml:

apiVersion: 1

providers:
  - name: 'Prometheus'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

Grafana Dashboard configuration

With this configuration, we are telling Grafana that Prometheus is the name of its first Dashboard provider. At the same time, we are letting Grafana know that we will load all dashboards located in /etc/grafana/provisioning/dashboards. Although we have already established a data provider for our dashboards, we still need the actual endpoint for Prometheus. This is done in our datasource.yml file:

apiVersion: 1

deleteDatasources:
  - name: Prometheus
    orgId: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    orgId: 1
    url: http://prometheus:9090
    basicAuth: false
    version: 1

datasource.yml

After defining the data source and the dashboard data provider we finally need to place all our dashboards in /etc/grafana/provisioning/dashboards. Grafana provides ways to save these dashboards in a JSON format. They do not need to be referenced by Grafana. As long as they reside in that folder, we will see that Grafana will read all of them. The Dashboard themselves are quite big files and they are available in the project in the folder /docker-files/grafana. The two JSON files there are the setup files for the NGINX machine (node-grafana.json) and for the JVM’s (jvm-grafana.json). Grafana seems to have some compatibility issues with Firefox. I do not consider them to be critical, and they are something that I was never able to reproduce in terms of UX (User eXperience). However, they do exist and the cypress tests were able to detect a lot of them, and this is why, if we take a look at commands.ts, we’ll see a few exception rules for certain thrown errors:

Cypress.on('uncaught:exception', (err, runnable) => {
    if (err.message && err.message.trim().length > 0 && err.name && err.name.trim().length > 0) {
        if (err.message.indexOf('setting getter-only property "data"') >= 0) {
            return false;
        }
        if (err.message.indexOf('Cannot read properties of null') >= 0) {
            return false;
        }
        if (err.message.indexOf('too much recursion') >= 0) {
            return false;
        }
        if (err.message.indexOf('The operation was aborted') >= 0) {
            return false;
        }
        if (err.message.indexOf('undefined') >= 0) {
            return false;
        }
    } else {
        return false;
    }
    return true;
})

After running Grafana, we should be getting graphs like these:

Dasboards examples for Grafana

14. Docker composer orchestration

The docker-compose file located at the root of the project contains the main setup to start the application. This is how we implement our docker-compose orchestration as seen in our first diagram. I mentioned before a few ways to start this application locally. We can try and run it all in one go. The script available to do this is the start demo:

make start-demo

If this is successful, we will start the insecure version of our application. we want to start the secure application we can then run:

make start-demo-secure

This is the look of our page at port 8080:

Moving Objects application

15. Conclusion

We have finally reached the end of this article. It is quite extended, but I hope to have been able to transmit the essential points I found to be of inspiration with this sort of Monitoring architecture. I didn’t want it to become too complicated but I also wanted to make it a fun article with a few surprises. In this article we have seen how to set up a very basic configuration of Prometheus, Grafana and InfluxDB applied to Spring Boot and NodeJS applications. We have seen images as examples of how the Dashboards look and feel. Looking back to all three types of dashboards and the three different brands, we can immediately get the idea that these are essentially just different monitoring tools and they can all be based on simple Prometheus metrics endpoints. We can also see that we can make our own metrics and that those can be programmed in the code. It can be overwhelming to make a decision when using these mechanisms. For example, Grafana, seems to be a great tool to visualize data and although we have seen in this article, how to integrate it with the Prometheus service, we can also integrate it with InfluxDB. But now maybe we get a bit taken aback because InfluxDB itself has Dashboards, so why do we need Grafana anyway? Maybe there are functionalities in Grafana that you cannot find in InfluxDB. Or perhaps, in the same way, Telegraf, can lower the CPU usage of InfluxDB, and so can Grafana. In fact, Grafana can be considered just a visualization framework, Prometheus the source of all metrics, InfluxDB the persistence database, and Telegraf, the scrapper we need. If we do it this way, then we get a completely decoupled system where each part has its own responsibility. Therefore, we minimize performance issues and the influence of metrics on our application. So, in regard to security, using JWT and Okta as authentication/authorization mechanisms, we can see that in order to be able to monitor we can make the endpoints open. But should we? We probably should not leave them open, but they frequently are open. What usually happens is that they remain open for the in-premise network, but to the outside world, they are kept private. So this means that not even someone with full access is able to access these endpoints. The techniques to do this are beyond the scope of this article, but solutions such as NGINX, Kong, and other gateways are available to provide functionality, and they usually offer integrations with Cloud providers.

16. References

Thank you!

I hope you enjoyed this article as much as I did making it!
Please leave a review, comments or any feedback you want to give on any of the socials in the links bellow.
I’m very grateful if you want to help me make this article better.
I have placed all the source code of this application on GitHub.
Thank you for reading!