Writing Prometheus Alert Rules with Nickel

Tutorials

by Pim Snel/ on 20 Jun 2023

Writing Prometheus Alert Rules with Nickel

This blog post, gives a short demonstration of how to convert Prometheus rules written in YAML to efficient Nickel code that can generate YAML.

One of our clients gave us the task of setting up their Grafana Cloud with tens of dashboards and Prometheus alert rules. Similar to the Infrastructure as Code principle, we use the Dashboards as Code approach here and for Prometheus alerts, well, Alerts as Code.

The dashboards are defined in JSON for Grafana, but we wrote the dashboard definitions in jsonnet with the grafonnet-library. This prevented us from writing long, redundant code. This is the way Grafana recommends.

The Prometheus Alerts Rules are offered as YAML files, and therein lies a lot of repetition. Besides being a lot of writing, it’s also error-prone to write so much code, and it is essential to consider the configuration drift that occurs quickly when you write redundant configuration code. By generating code, you work towards a single source of truth.

“Configuration drift occurs when a standardized group of IT resources, be they virtual servers, standard router configurations in VNF deployments, or any other deployment group that is built from a standard template, diverge in configuration over time. … The Infrastructure as Code methodology from DevOps is designed to combat Configuration Drift and other infrastructure management problems.”

Kemp Technologies on Configuration Drift

There are several ways to generate YAML code, but as a Nix and NixOS enthusiast, I decided to experiment with the new programming language Nickel. This language comes from the Nix community, and the 1.0 version of Nickel was officially announced by Tweag, the company that is developing it, in May 2023.

Let’s get started!

The old way

Below is an example of 2 Prometheus alert rules as they were written in the old situation. We’ll work in a few steps to a version in Nickel that generates comparable, or even better, YAML.

groups:
  - name: mycompany
    rules:
      - alert: aws_documentdb_freeable_memory_low
        expr: |-
            16 / (aws_docdb_freeable_memory_average{dbinstance_identifier="tf-created-AABBCCDD"}
            / (1024^3)) * 100 < 25
        for: 10m
        labels:
          team: devops
          platform: aws
        annotations:
          description: 'DocumentDB: the instance has less then 25% freeable freeable memory.'
          summary: DocumentDB is has low freeable memory
          jira_ticket: aws-devops
          severity: 2

      - alert: aws_documentdb_cpu_usage_too_high
        expr: aws_docdb_cpuutilization_average > 90
        for: 10m
        labels:
          team: devops
          platform: aws
        annotations:
          summary: "DocumentDB is reaching its cpu limit"
          description: "DocumentDB: the instance are using more than 90% of the cpu assigned"
          jira_ticket: aws-devops
          severity: 2

You can see a lot of duplicated information at first glance. This includes all the key names, as well as meta information, Jira tickets, labels, etc… Above, you see two alerts, but in the actual situation, there are dozens, if not hundreds, of alert definitions. There is definitely room for improvement here.

Now, we will take a few steps to convert this Prometheus YAML into smarter Nickel source code.

Step 1: YAML to Flat Nickel

Let’s start by simply converting the YAML to Nickel so that we can then export it back to YAML.

{
  groups = [
    {
      name = "mycompany",
      rules = [
        {
          alert = "aws_documentdb_freeable_memory_low",
          expr = m%"
            16 /
            (aws_docdb_freeable_memory_average{dbinstance_identifier="tf-created-AABBCCDD"} / (1024^3))
            * 100 < 25
          "%,
          for = "10m",
          labels = {
            team = "devops",
            platform = "aws"
          },
          annotations = {
            description = "DocumentDB = the instance has less then 25% freeable freeable memory.",
            summary = "DocumentDB is has low freeable memory",
            jira_ticket = "aws-devops",
            severity = "2",
            }
          },
          {
            alert = "aws_documentdb_cpu_usage_too_high",
            expr = "aws_docdb_cpuutilization_average > 90",
            for = "10m",
            labels = {
              team = "devops",
              platform = "aws"
            },
            annotations = {
              summary = "DocumentDB is reaching its cpu limit",
              description = "DocumentDB = the instance are using more than 90% of the cpu assigned",
              jira_ticket = "aws-devops",
              severity = 2
            }
          }
        ]
      }
    ]
  }

Above is a static version of the YAML in Nickel. Easily recognizable. It’s a bit like JSON but with = instead of :. You also immediately see an example of how to place strings on multiple lines using the m%"Hello hello.\nText on a new line"% syntax.

If you save this file as b01-prometheus-rules.ncl, you can export it to YAML using the command:

nickel -f b02-prometheus-rules.ncl export --format yaml

Of course, you can save this in a file by adding > prometheus-rules.yml to the command line.

nickel -f b02-prometheus-rules.ncl export --format yaml > prometheus-rules.yml

Haven’t installed Nickel yet? Read the Getting Started page on nickel-lang.org
Are you a Mac user? brew install nickel.

Now we’ll take a big step by introducing a function.

Step 2: A function for the alert definition.

We create a function with as input the name, the texts describing the issues, the duration, the expression, and the severity. Then we call the function in the rules block. We can also convert this code to YAML using the nickel command.

let alert_rule = fun name problem_txt_short problem_txt_long duration sev expression => {
  alert = name,
  expr = expression,
  for = duration,
  labels = {
    team = "devops",
    platform = "aws"
  },
  anotations = {
    summary = "%{problem_txt_short}",
    description = "%{problem_txt_long}",
    severity = std.string.from_number sev,
    }
} in

{
  groups = [
    {
      name = "mycompany",
      rules = [
        alert_rule "documentdb_freeable_memory_low" "DocumentDB freeable memory low" "DocumentDB: the instance has less then 25% freeable freeable memory." "10m" 2 m%"
          16 /
          (aws_docdb_freeable_memory_average{dbinstance_identifier="tf-created-AABBCCDD"} / (1024^3))
          * 100 < 25
        "%,

        alert_rule "documentdb_cpu_usage_too_high" "DocumentDB cpu usage too high" "DocumentDB: the instance are using more than 90% of the cpu assigned" "10m" 2 m%"
          aws_docdb_cpuutilization_average > 90
        "%
      ]
    }
  ]
}

Let’s puzzle out what’s happening here. let alert_rule = fun ...=> {} in defines a function and assigns this function to the variable alert_rule. In this case, the function returns an object, or in other words, a dictionary. Then the actual construction of the output begins. In the block rules = [...], the function is called twice for each alert definition.

The original YAML was 28 lines, the first Nickel conversion was 43 lines, and now we’re back to 33 lines. With one or two more alert definitions, our file will already be more compact than the originally written YAML.

It takes some getting used to how functions are defined in Nickel. Nickel is inspired by languages like Haskell and Nix, which are functional languages. People with a strong sense of mathematics usually feel at home in functional languages. Personally, I’m more of an alpha-person and fond of languages with a lot of syntactic sugar, like my favorite language, Ruby.

In the next step, we’ll make one final optimization to make the life of the DevOps engineer even easier.

Step 3: Dotting the i’s

If you look at the texts, you’ll notice that the alert name and the short and long descriptions have many recurring words. Actually, the alert name could be generated based on a service name plus the short description. The service name can also be reused in the short and long descriptions.

let normalize_string = fun instring => std.string.replace_regex "[%,.]+" "" (std.string.replace " " "_" (std.string.lowercase instring)) in

let alert_rule = fun service_name problem_txt_short problem_txt_long duration sev expression => {

  alert = "%{normalize_string service_name}_%{normalize_string problem_txt_short}",
  expr = expression,
  for = duration,
  labels = {
    team = "devops",
    platform = "aws"
  },
  annotations = {
    summary = "%{service_name} is unhealty. %{problem_txt_short}",
    description = "%{service_name} is unhealty. %{problem_txt_long} For at least %{duration}.",
    severity = std.string.from_number sev,
    }
} in

{
  groups = [
    {
      name = "mycompany",
      rules = [
        alert_rule "DocumentDB" "freeable memory low" "The instance has less then 25% freeable memory." "10m" 2 m%"
          16 /
          (aws_docdb_freeable_memory_average{dbinstance_identifier="tf-created-AABBCCDD"} / (1024^3))
          * 100 < 25
        "%,

        alert_rule "DocumentDB" "cpu usage too high" "The instance are using more than 90% of the cpu assigned." "10m" 2 m%"
          aws_docdb_cpuutilization_average > 90
        "%
      ]
    }
  ]
}

We have introduced a new function normalize_string. We use this to generate the name of the alert. We remove incorrect characters, replace spaces with underscores, and make the name lowercase. Then we reuse the Service Name DocumentDB in the issue descriptions. These small improvements result in more efficient function calls and more consistent and useful Alert information.

Conclusion

By rewriting our Prometheus alerts in a few steps to Nickel, we have achieved various improvements.

Our source code has become more compact and has almost no redundant characters or text.
We have largely eliminated duplicate source code.
Writing new alerts is much faster.
The source code is less prone to typing errors that can occur when copying and pasting previous definitions.
By using a function, we run less risk of configuration drift.

We could have also used Python or another general-purpose programming language. However, this requires a lot of boilerplate code and also requires quite a bit of documentation to get others up to speed in the workflow.

Nickel is specifically designed to generate JSON and YAML configuration files in an efficient manner. Conversion to YAML requires a single command and can easily be included in a Makefile or Pipeline.

Of course, we can still improve the above further. We can create a pipeline with a Nickel linter, we can import the functions from a shared library file so that other projects can also use them, etc.

What I wanted to show is that in many cases, it is wise to generate YAML or JSON configurations using a language like Nickel. Personally, I have become quite fond of the syntax of Nickel after playing with it, and I will definitely use it more in the future.

If you see an error or would like to respond for another reason, please use the form below.

You can download the files from this tutorial in the git-repo that belongs to this article.

Update with code improvements

update: July 5 2023

Super cool! Yann Hamdaoui, the lead developer of Nickel, responded to this blog article with a number of suggestions to improve the Nickel code examples. Yann’s suggestions showcase more advanced features of Nickel, such as using the reverse application operator |>, type casting, converting functions into bare records, using contracts, and he also suggests using a record as a function parameter instead of separate parameters.

Below, I present a final refactoring where I apply the reverse application operator and I using the record as function parameter instead of using separate parameters. To keep this article readable for beginners, I will not implement Yann’s other suggestions in this blog post. Though I highly recommend to read Yann’s response on the Nickel project page to learn about the other improvements.

# Using the "reverse application operator"
let normalize_string = fun instring =>
  instring
  |> std.string.lowercase
  |> std.string.replace " " "_"
  |> std.string.replace_regex "[%,.]+" ""
in

let alert_rule = fun {
   service_name,
   problem_txt_short,
   problem_txt_long,
   duration sev expression
   } => {
  alert = "%{normalize_string service_name}_%{normalize_string problem_txt_short}",
  expr = expression,
  for = duration,
  labels = {
    team = "devops",
    platform = "aws"
  },
  annotations = {
    summary = "%{service_name} is unhealty. %{problem_txt_short}",
    description = "%{service_name} is unhealty. %{problem_txt_long} For at least %{duration}.",
    severity = std.string.from_number sev,
    }
} in

{
  groups = [
    {
      name = "mycompany",
      rules = [
        alert_rule {
           service_name = "DocumentDB",
            problem_txt_short = "freeable memory low",
            problem_txt_long = "The instance has less then 25% freeable memory.",
            duration = "10m",
            sev = 2,
            expression = m%"
              16 /
              (aws_docdb_freeable_memory_average{dbinstance_identifier="tf-created-AABBCCDD"} / (1024^3))
              * 100 < 25
            "%,
        alert_rule {
           service_name = "DocumentDB",
            problem_txt_short = "cpu usage too high",
            problem_txt_long = "The instance are using more than 90% of the cpu assigned.",
            duration = "10m",
            sev = 2,
            expression = "aws_docdb_cpuutilization_average > 90"
      ]
    }
  ]
}