Microservices Monitoring
Monitoring used to be a somewhat passive thing. You used tools to monitor the application process/logs and perhaps send an alert if something seemed wrong, but mostly it was hands off. When we move to microservices architecture, things change.
User Experience and Microservices Monitoring
With Microservices which are released more often, you can try new features and see how they impact user usage patterns. With this feedback, you can improve your application. It is not uncommon to employ A/B testing and multivariate testing to try out new combinations of features. Monitoring is more than just watching for failure. With big data, data science, and microservices, monitoring microservices runtime stats is required to know your application users. You want to know what your users like and dislike and react appropriately.
Debugging and Microservices Monitoring
Runtime statistics and metrics are critical for distributed systems, since microservices architecture uses a lot of remote calls. Monitoring microservices metrics can include requests per second, available memory, #threads, #connections, failed authentication, expired tokens, etc. These parameters are important for understanding and debugging your code. Working with distributed systems is hard. Working with distributed systems without reactive monitoring is crazy. Reactive monitoring allows you to react to failure conditions and ramp up services for higher loads.
Circuit Breaker and Microservices Monitoring
You can employ the Circuit Breaker pattern to prevent a catastrophic cascade, and reactive microservices monitoring can be the trigger. Downstream services can be registered in a service discovery so that you can mark nodes as unhealthy as well as react by reroute in the case of outages. The reaction can be serving up a deprecated version of the data or service, but the key is to avoid cascading failure. You don’t want your services falling over like dominoes.
Cloud Orchestration and Microservices Monitoring
Reactive microservices monitoring would enable you to detect heavy loads and spin up new instances with the cloud orchestration platform of your choice (Kubernetes, EC2, CloudStack, OpenStack, Rackspace etc.).
Public Microservices and Microservices Monitoring
Microservices monitoring of runtime statistics can be used to rate limit a partner’s Application Id/client Id. You don’t want partners to consume all of your well-tuned, high-performant microservices resources. It is okay to trust your partners but use Microservices Monitoring to verify.
Monitoring public microservices is your way to verify. Once you make microservices publicly available or partner available, you have to monitor and rate limit.
This is not a new concept. If you have ever used a public REST API from Google for example, you are well aware of rate limiting. A rate limit will do things like limit the number of connections you’re allowed to make. It is common for rate limits to limit the number of certain requests that a client id or partner id is allowed to make in a given time period. This is protection.
Deploying public or partner accessible microservices without this protection is lunacy and a recipe for disaster, unless you like failing when someone decides to hit your endpoints 10x more than you did the capacity planning for. Avoid long nights and tears. Monitor microservices that you publish, and limit access to them.
Microservices Framework and Microservices Monitoring
Light-4J a microservices framework that comes with a runtime metrics which can be used for Microservices Monitoring.
- You can query /server/health endpoint to detect if the service is available and healthy.
- The framework collects metrics info and pushes it into influxdb and dashboard can be viewed from Grafana.
- Rate limiting can be enabled at client_id level or ip address/user level.
- Kubernetes monitors load of each pods and can start new instances on demand.
- TraceabilityId and CorrelationId in logs that can be traced with tools like Logstash, GrayLog and Splunk.
- Specifically designed error code can be monitored and send alert if some of them shown up in logs.
Reactive Microservices Monitoring
Reactive Microservices Monitoring is an essential ingredient of microservices architecture. You need it for debugging, knowing your users, working with partners, and building reactive systems that react to load and failures without cascading outages. Reactive Microservices Monitoring cannot be a hindsight decision. Build your microservices with microservices monitoring in mind from the start. Make sure that the microservices lib that you use has monitoring of runtime statistics built in from the start. Make sure that is a core part of the microservices library. Code Hale Statistics allow you to gather metrics in a standard way. Tools like Influxdb and Grafana, Kibana help you understand the data, and build dashboards. Light 4J, the Java Microservices Framework, includes a metrics middleware which feeds into CodeHale Metrics. Light 4J also provides a rate limiting middleware to limit access per client_id or IP address/user. The container orchestration tool like Kubernetes can also spin up new nodes/pods. With big data, data science, and microservices, monitoring microservices runtime stats is required to know your application users, know your partners, know what your system will do under load, etc.
Microservice Logging
Every instance of the service will have a unique identifier which most commonly will be the docker container name or the hostname if not deployed in docker container. The code to retrieve docker container name and hostname is the same.
Along with docker container name, traceabilityId and correlationId will be logged as context info for each logging statement. And once log files are aggregated together in ELK, users can trace a particular transaction based on the traceabilityId or correlationId.
As microservices might be deployed across multiple geo-regions, the timestamp logged must be UTC time so that logs can be easily ordered in the ELK.
Microservice Alerting
Logstash has features to send out alerts when a certain error code is spotted in the log files.
The framework has a component called status and it has all the errors defined in a YMAL file which can be externalized. All the error code will be in a format ERRXXXXX and certain error codes can be set up in the alert to send out emails or communicate to support teams with other channels.