https://github.com/robusta-dev/robusta: 0.10.17 dated June 1st 2023.
Application and infrastructure observability is fundamental. Engineering teams employ various platforms, stacks and individual tools to stay informed about the state of their hardware and software systems. All those tools offer analytical and, in some scenarios, near-real-time observability capabilities, such as monitoring, logging, tracing, alerting, profiling, debugging etc.
Alerting, in particular, is really important as it allows the teams to receive timely notifications about changes in the state of their systems, whether those are indicative of unexpected failures, dangerous trends or rather confirming that the systems are back on track.
It is equally important for engineers to understand the context of any given alert notification, as in many cases a detailed investigation is required, which may take a significant amount of time. Some of the tools try to correlate events from different sources and enrich notifications this way.
Lack of observability almost always leads to major increase in maintenance costs. This broad category implies increased time and effort engineering teams spend on resolving various production issues and reduced capacity in other areas, as well as degraded customer experience and possible reputation damage.
In this review we are taking a deep look into Robusta, an open-source Kubernetes observability rules engine. Its primary responsibility is to notify users about various changes happening in a given Kubernetes cluster. Before we dive into the system’s codebase, let’s briefly learn what are its main components, configuration options and features.
In Robusta, a Playbook
is a rule that tells the system which Actions
to perform upon certain Triggers
. Users configure the engine with a custom set of Playbooks
. The engine watches over one or multiple streams of events and matches them agains the Triggers
in the Playbooks
. If an event does conform to a Trigger
, the corresponding Actions
are launched. Surprisingly this simple workflow enables Robusta to offer many interesting features and approaches to help users to conveniently build up flexible observability.
Events originate from Kubernetes API and optionally from Prometheus (or Grafana) AlertManager and ElasticSearch. Additionally, as part of Actions
, the system uses these sources to query auxiliary context information, such as resource definitions, logs, graphs and more.
Besides events from the mentioned sources, Robusta supports Triggers
based on custom schedules (e.g. Cron) and Web-hook requests, which allow users to further customize and integrate the product into their infrastructure deployments.
If in the system configuration there’s a Playbook
with a Trigger
that matches a particular event from a stream, Robusta runs the Actions
associated with the Playbook
. The engine comes with an impressive set of ready-to-use Actions
to address many use-cases, including:
Enriching event notifications with various context information, which may be obtained from related Kubernetes Nodes and Pods, Prometheus metrics and Grafana graphs and other sources.
Adding profiling and debugging capabilities and the corresponding instructions into event notifications.
Running automatic remediation processes, such as shell scripts within related Kubernetes Nodes and Pods and more.
When it comes to delivering notifications, Robusta supports many popular Sinks
, including Slack, Microsoft Teams, Discord, PagerDuty etc, as well as more sophisticated integrations through Web-hooks, Apache Kafka and others.
It is its broad set of Actions
that makes the engine quite an interesting tool, as users can, for example, add valuable context to their event notifications and even run some custom code to perform certain operations to investigate things further. The system also provides means to help profile and debug Python and Java processes without the need to additionally instrument the code in any way.
Now let’s jump right into the implementation details.
Forwarder
How does Robusta get events from Kubernetes API? You might think that the system utilizes the watch
API routes directly, but it is not the case. Instead, it relies on Kubewatch, a dedicated service that reads events and forwards them to one or more preconfigured target destinations, such as Slack or a Web-hook. But this sounds like what Robusta does, doesn’t it? Yes, but Robusta gives you much more with various Actions
it can perform upon certain noteworthy events. Let’s take a look at the Kubewatch configuration:
helm/robusta/templates/kubewatch-configmap.yaml#L1
As you can see, the only handler, which is the only target destination that Kubewatch forwards its events to is Robusta’s Runner component. On the diagram above the Kubewatch instance is called Forwarder. It used to be maintained by Bitnami / VMWare, but now the project is under the robusta-dev organization in GitHub.
The other part of the Forwarder’s configuration is a set of flags to tell which Kubernetes resources to listen to.
helm/robusta/kubewatch.yaml#L1
Most of the resource types are enabled, allowing Robusta to granularly track changes within a Kubernetes cluster.
It may be worth noting that Kubewatch neither seems to guarantee event delivery, nor supports any sort of persistence, including cursors to start reading from in case of failures. Although this may not be critical in the context of an observability system, it is still potentially possible to miss important events.
We briefly looked into the Forwarder component. Let’s find out where forwarded events end up.
Runner
Robusta’s Runner component is written in synchronous Python. It spawns an HTTP server using Flask, a popular Web application framework. Here’s its initialization sequence:
src/robusta/runner/main.py#L21
Some key objects are initialized, in part, instances of Registry
, ConfigLoader
, PlaybookEventHandlerImpl
, and Web
. We are going to cover each of these in detail.
First in line, Registry
, is just a configuration class that holds references to certain collections of objects, such as Actions
, Sinks
and Playbooks
. These collections are represented by ActionsRegistry
, SinksRegistry
and PlaybooksRegistry
classes correspondingly.
src/robusta/model/config.py#L152
The ConfigLoader
class is responsible for reading the YAML configuration file, loading Playbooks
from the file system, and reloading them in case they change. In general, classes with multiple responsibilities should be avoided as it increases complexity, and therefore decreases maintainability. This class may be thought of as a Registry
factory, but the naming does not seem to match. Let’s look inside its implementation:
src/robusta/runner/config_loader.py#L42
We are deliberately skipping some of its parts and they mainly cover reading files from the underlying file system and importing Python modules from built-in and custom Playbooks
. Specifically, the __load_runner_config
, __load_playbook_repos
methods are the ones which do that.
As you can see in the __init__ method, there are two FileSystemWatcher
instances which receive references to the __reload_playbook_packages
method. This method sets the key attributes of the associated Registry
object. Since this is synchronous Python, we can assume that the FileSystemWatcher
class employs threading to watch for the underlying file system and dispatches calls to __reload_playbook_packages
on any relevant changes.
Since the bounded method is shared between the instances of FileSystemWatcher
, there’s a reentrant lock under the reload_lock
attribute that ultimately governs access to the Registry
instance. In case there was no lock, undefined behavior might take place as the threads could concurrently set those attributes with potentially inconsistent values.
Before we peep into FileSystemWatcher
`s implementation, it is worth noting that __reload_playbook_packages
has at least two areas that could be improved: