One of the challenges I have encountered with the event-driven distributed architecture consisted in not being able to reconcile the data processed by various services.
The problem was enhanced by integrating the services with external components Salesforce Marketing Cloud - and by the use of various user's data input sources: desktop web site, iOS and Android devices.
When the Marketing VP noticed a consistent drop in the Android emails sent to the customers by the Salesforce Marketing Cloud, the need to gather and understand a big amount of data processed by various services became crucial.
This article provides details about how to trace the messages exchanged between services in a distributed architecture by using Spring Cloud Sleuth and Zipkin server.
This technology allows for collecting and correlating logs across components by instrumenting all the communication path and providing a visual graph (see below).
Spring Cloud Sleuth provides Spring Boot auto-configuration for distributed tracing.
Zipkin is an open source version of Google's Dapper that was further developed by Twitter and can be used with JavaScript, PHP, C#, Ruby, Go, Java.
This sample project has 5 microservices: an HTTP request triggers the Publisher and the Subscriber services to produce and consume an event via the Kafka cluster. When the event it is consumed it triggers additional requests to the rest of the services; the last web service simulates a slow operation by adding a randomized value for a timer.
A docker-compose.yaml file it is used to start the Kafka cluster and the Zipkin server.
All the services are started in VS Code and upon executing the first request the log captures the communication:
Opening the Zipkin dashboard http://localhost:9411/zipkin, you can query for the services, requests, a particular span or tag.
You have the ability to create your own span in the code and mark a slow running operation or add custom data - event- into the log that can be exported as JSON at the top-right of the page.
By looking at the exported log file you can see the global TraceID and the correlation ids for each operations.
With a simple SQL query this JSON can be converted to a table, if needed to be stored for later investigation.
By default Sleuth exports 10 spans per second but you can set a property in the code: spring.sleuth.sampler.probability to allow only a percentage of messages to be logged.
Another customization that can be made is to skip patterns of API calls from being added to the trace.
I will continue this article with a few details about the code changes required.
An interesting follow up to explore is the monitoring capability that exists in Azure for Spring Cloud apps (see link and image below):
Коментарі