Friday, June 20, 2025

Unified Alerting in Grafana: Codifying Your Monitoring Stack with Helm and GitFlow


Infrastructure has shifted. You don’t manually click your way to production anymore—everything’s code. But somehow, alerting often lags behind. Grafana’s Unified Alerting model, introduced in version 8, changes that. Now, with proper use of values.yaml and Helm, you can manage alert rules, routing logic, contact points, and data sources like you would any other critical system component: version-controlled, peer-reviewed, and auto-deployed.

This post walks through how to declaratively manage your entire Grafana alerting pipeline with Helm—tying it into a GitFlow CI/CD strategy to lock in reliability, reproducibility, and observability from day one.


Why Move Alerting Configs into values.yaml?

This isn’t about preference—it’s about control. Treating alerting as code gives you:

  • Consistency: No more config drift between staging and prod.

  • Version Control: Rollbacks, blame history, diffs—all standard Git workflows.

  • Automation: CI/CD pipelines apply changes the moment they’re merged.

If it’s important enough to alert on, it’s important enough to store in Git.


Build It First in Grafana UI—Then Export

Grafana’s GUI is still the fastest way to prototype alert logic. Create and test your alert in the UI, then export it:

  • Navigate to Alerts & IRM → Alert Rules

  • Click the alert or group

  • Choose Export → YAML

Drop the result into your Helm values.yaml, commit, and move on.


Grafana Configuration Blocks in values.yaml

Below is a breakdown of the core blocks and how they map to Grafana's Unified Alerting model:


datasources — Where Grafana Pulls Data From

This block defines the backends Grafana queries for metrics—Prometheus, CloudWatch, etc.

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus-server.prometheus.svc.cluster.local
        isDefault: true

contactPoints — Where Alerts Are Sent

Defines alert destinations: Slack, PagerDuty, email, etc.

contactPoints:
  - name: slack
    orgId: 1
    matchers:
      - severity: warning
    receivers:
      - type: slack
        settings:
          recipient: infoservices-alerts-prod
          url: '<slack_webhook_url>'

policies — Routing Logic

Think of this like Alertmanager’s routes. It controls how alerts are grouped and where they go.

policies:
  - orgId: 1
    receiver: slack
    group_by:
      - grafana_folder
      - alertname
    routes:
      - receiver: pagerduty
        object_matchers:
          - - severity
            - =
            - critical

groups — Bundling Related Alerts

Logical groupings of alert rules by function, service, or folder. These groups get evaluated at a set interval.

groups:
  - name: rds_alerts_group
    folder: aws_alerts
    interval: 5m
    orgId: 1
    rules: []

rules — The Alert Logic

Here’s where the actual monitoring happens. Set up query logic, thresholds, durations, and labels.

rules:
  - title: CPU Utilization Alert
    condition: C
    data:
      - refId: A
        datasourceUid: cloudwatch
        model:
          metricName: CPUUtilization
          namespace: AWS/RDS
          period: "300"
      - refId: B
        datasourceUid: __expr__
        model:
          conditions:
            - evaluator:
                params: [80]
                type: gt
              operator:
                type: and
              query:
                params: [A]
          type: threshold
    for: 5m
    labels:
      team: sre_team
      alert_type: CPU_Utilization
    annotations:
      summary: High CPU Utilization Alert
      description: CPU usage > 80% for 5m on an RDS instance

CI/CD Workflow with GitFlow

Treat Grafana alerting like the rest of your stack:

  1. Feature Branch: Add or modify alert logic in a branch.

  2. Pull Request: Collaborate with the team. Get eyes on it.

  3. Merge: CI/CD picks up the change. Helm rolls it out automatically.

No click-ops. No config drift. No surprises.


TL;DR – Everything Declarative

Section Role Highlights
datasources Metric backends Prometheus, CloudWatch, Elasticsearch, etc.
contactPoints Alert receivers Slack, PagerDuty, email, webhook endpoints
policies Routing rules Route by severity, service, labels
groups Group alerts logically Evaluated every N minutes, grouped by folder
rules Alert conditions Query + threshold + duration + metadata

Here’s a full values.yaml scaffolding for managing Grafana’s Unified Alerting via Helm. It includes example configurations for:

  • Datasources

  • Contact points

  • Notification policies

  • Alert groups

  • Alert rules

All fields are commented to help guide edits inline.


grafana:
  enabled: true

  ## === 1. Datasources ===
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          access: proxy
          url: http://prometheus-server.prometheus.svc.cluster.local
          isDefault: true

  ## === 2. Unified Alerting Config ===
  alerting:
    enabled: true
    unifiedAlerting: true

  ## === 3. Contact Points ===
  contactPoints:
    - name: slack
      orgId: 1
      receivers:
        - uid: slack-notify
          type: slack
          settings:
            recipient: '#infoservices-alerts-prod'
            url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'

    - name: pagerduty
      orgId: 1
      receivers:
        - uid: pagerduty-notify
          type: pagerduty
          settings:
            routing_key: '<pagerduty_integration_key>'

  ## === 4. Notification Policies ===
  policies:
    - orgId: 1
      receiver: slack
      group_by:
        - grafana_folder
        - alertname
      routes:
        - receiver: pagerduty
          object_matchers:
            - - severity
              - =
              - critical

  ## === 5. Alert Groups ===
  groups:
    - name: rds_alerts_group
      folder: aws_alerts
      interval: 5m
      orgId: 1
      rules:
        - title: CPU Utilization Alert
          condition: C
          data:
            - refId: A
              datasourceUid: cloudwatch
              model:
                namespace: AWS/RDS
                metricName: CPUUtilization
                region: us-east-1
                statistic: Average
                period: "300"
                dimensions:
                  DBInstanceIdentifier: my-rds-instance
            - refId: B
              datasourceUid: __expr__
              model:
                conditions:
                  - evaluator:
                      params: [80]
                      type: gt
                    operator:
                      type: and
                    query:
                      params: [A]
                type: threshold
          for: 5m
          labels:
            severity: critical
            team: sre_team
            alert_type: CPU_Utilization
          annotations:
            summary: RDS CPU > 80%
            description: CPU usage on RDS instance has exceeded 80% for 5 minutes.


Final Word

Grafana’s Unified Alerting engine lets you move fast without breaking observability. By managing alerts, contact points, and policies in Helm, then layering GitFlow on top, you get something most teams miss: repeatability. You know exactly what’s alerting, why, and who it notifies—because it’s all in code.

If your monitoring still lives in a UI somewhere, it’s time to promote it to the rest of your pipeline.


No comments:

Post a Comment