AIOps+ new smart monitoring solution based on AI

Author

Marek Koniew: Software Architect, Product owner at co.brick

AIOps+ new smart monitoring solution based on AI

If you are a DevOps or a developer who needs to maintain their microservices, you know better than anyone just how time consuming this task may become.

Integrating your services with metrics, logs, traces and alerts is a necessary evil that is going to keep distracting you from the real work. And if that wasn’t enough, once you’ve got all the monitoring set in place, the situation seems no better. Browsing through endless dashboards and logs to spot the issue turns out to be absolutely tedious work (even if you are lucky enough to have Grafana integrated with Loki).

At co.brick, we have experienced the same ordeal, which ultimately led us to question what it should look like in the future?

We believe that in times to come, metrics ought to be collected automatically from your services. Charts for the most common metrics are available out of the box. The monitoring system will know when the system behaves correctly. Otherwise, it’ll trigger off an alert. Plain and simple!

And by no means do we consider these schemes a distant future! We believe the Kubernetes cluster has all the information about your applications and backing services.

All necessary metrics are available out of the box: resource consumption, response times, error rates, etc. These metrics are just waiting to be analyzed by the state of the art algorithms like anomaly detection.

But there is more. We think that Artificial Intelligence and Machine Learning algorithms can take automatic actions to repair the most common issues without keeping you awake at night.

Imagine only simple tasks like restarting applications in case of a crash or automatically scaling up to deal with an unexpected traffic spike. We focus on the services deployed in the Kubernetes cluster, which provides a great set of APIs that can be easily employed to automate all the DevOps tasks.

The project's uniqueness lies in connecting all the best practices from the field and leveraging progress in the machine learning sphere. What’s more, we work closely with dedicated scientists to improve state-of-the-art approaches and discover new solutions to the most challenging problems.

The goals set for the AIOps in co.brick:

  • Setup monitoring and alerting. Our goal is to automate setting up telemetry on the system. We aim to follow well-established standards like Open Telemetry..
  • Proactive issue identification. We determine the incident proactively. Moreover, we will try to predict potential incidents based on current system behavior.
  • Troubleshooting. We will automatically detect the incident and automatically find the root cause by logs, metrics and trace correlations.
  • Reduce alert fatigue. We will automatically suppress alerts that have no impact on system stability or performance.
  • Automated remediation. Automatically take action to mitigate or resolve incidents.
  • Optimization. Automatically suggest actions for improving the system's resiliency, response time as well as operational cost.

We are going to reveal more details soon. Stay tuned! And by the way - we are hiring!
If you find this interesting, please browse through our vaccancies here: cobrick.com/jobs