Marek Koniew: Software Architect, Product owner at co.brick
Data is the fuel for AIOps
In this blog post we would like to tell you how we use Grafana Stack to automate metric collection and ML models training. The first milestone for any project based on Machine Learning is access to data. In AIOps it means collecting metrics and logs. The most obvious choice and de facto monitoring standard for Kubernetes is the Grafana Labs.
The Grafana Labs stack delivers main components out-of-the-box integrated with Kubernetes:
- Prometheus: a standard tool to collect metrics.
- Loki: highly scalable log collection.
- Tempo: a tool to collect application traces.
- Grafana: visualizes data from all above.
AIOps hosts all the above-mentioned components in the cloud. The only component required to be installed on the client is a metric exporter. For that purpose, we install a small Grafana Agent. The customers can access their metrics from services provided by co.brick.
In AIOps we first focus on collecting a lot of metrics from different sources to build a Machine Learning model from it. We believe metrics can best describe the system. But this comes at a cost. The cost is the vast number of metrics and massive storage for historical data. The data size is even more critical for AIOps because Machine Learning methods are data-hungry. Because of that, we decided to employ Cortex. It is another service from the Grafana Labs. We use it as a highly scalable metrics storage for Prometheus.
Now we are prepared to collect metrics for a long time. There are two classes of metrics that we will want to track and monitor at a high level:
- Request throughput.
- Error rate.
- Request latencies.
- Request and Response body size.
- CPU utilization.
- Memory utilization.
- Network data transfer.
- Disk I/O.
- The number of service instances.
Where to get these metrics from? Apart from standard metrics exporters like the Node Exporter, the easiest way to collect system metrics on Kubernetes, is to use service mesh. There are two strong competitors in the service mesh world. The first one is Linkerd, it is super lightweight and secure. The other one is Istio, a feature-rich system for heavy-duty use cases. AIOps is going to support both of them. We will help to automate service mesh deployments and metrics collection.
Finally, the dull task comes: observing metrics. Thanks to Machine Learning methods, we can ease it up. There are many, many ways, techniques, and algorithms to watch the metrics. In the beginning, we focus on the tasks which will significantly improve DevOps’ lives:
- Anomaly detection: see the anomalies in the time series from Prometheus metrics.
- Incident detection and Root cause analysis: identify the issue automatically and recommend mitigation.
We are creating Machine Learning models to observe Prometheus metrics and take automatic actions such as alerting, optimization, remediation, etc. But that is all for today. Stay tuned for more details regarding Machine Learning methods we are going to use, and more!
Want to become a part of the team working on this product? Take a look at our job offers at cobrick.com/jobs