Infrastructure has shifted. You don’t manually click your way to production anymore—everything’s code. But somehow, alerting often lags behind. Grafana’s Unified Alerting model, introduced in version 8, changes that. Now, with proper use of values.yaml
and Helm, you can manage alert rules, routing logic, contact points, and data sources like you would any other critical system component: version-controlled, peer-reviewed, and auto-deployed.
This post walks through how to declaratively manage your entire Grafana alerting pipeline with Helm—tying it into a GitFlow CI/CD strategy to lock in reliability, reproducibility, and observability from day one.
Why Move Alerting Configs into values.yaml
?
This isn’t about preference—it’s about control. Treating alerting as code gives you:
-
Consistency: No more config drift between staging and prod.
-
Version Control: Rollbacks, blame history, diffs—all standard Git workflows.
-
Automation: CI/CD pipelines apply changes the moment they’re merged.
If it’s important enough to alert on, it’s important enough to store in Git.
Build It First in Grafana UI—Then Export
Grafana’s GUI is still the fastest way to prototype alert logic. Create and test your alert in the UI, then export it:
-
Navigate to Alerts & IRM → Alert Rules
-
Click the alert or group
-
Choose Export → YAML
Drop the result into your Helm values.yaml
, commit, and move on.
Grafana Configuration Blocks in values.yaml
Below is a breakdown of the core blocks and how they map to Grafana's Unified Alerting model:
datasources
— Where Grafana Pulls Data From
This block defines the backends Grafana queries for metrics—Prometheus, CloudWatch, etc.
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server.prometheus.svc.cluster.local
isDefault: true
contactPoints
— Where Alerts Are Sent
Defines alert destinations: Slack, PagerDuty, email, etc.
contactPoints:
- name: slack
orgId: 1
matchers:
- severity: warning
receivers:
- type: slack
settings:
recipient: infoservices-alerts-prod
url: '<slack_webhook_url>'
policies
— Routing Logic
Think of this like Alertmanager’s routes. It controls how alerts are grouped and where they go.
policies:
- orgId: 1
receiver: slack
group_by:
- grafana_folder
- alertname
routes:
- receiver: pagerduty
object_matchers:
- - severity
- =
- critical
groups
— Bundling Related Alerts
Logical groupings of alert rules by function, service, or folder. These groups get evaluated at a set interval.
groups:
- name: rds_alerts_group
folder: aws_alerts
interval: 5m
orgId: 1
rules: []
rules
— The Alert Logic
Here’s where the actual monitoring happens. Set up query logic, thresholds, durations, and labels.
rules:
- title: CPU Utilization Alert
condition: C
data:
- refId: A
datasourceUid: cloudwatch
model:
metricName: CPUUtilization
namespace: AWS/RDS
period: "300"
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [80]
type: gt
operator:
type: and
query:
params: [A]
type: threshold
for: 5m
labels:
team: sre_team
alert_type: CPU_Utilization
annotations:
summary: High CPU Utilization Alert
description: CPU usage > 80% for 5m on an RDS instance
CI/CD Workflow with GitFlow
Treat Grafana alerting like the rest of your stack:
-
Feature Branch: Add or modify alert logic in a branch.
-
Pull Request: Collaborate with the team. Get eyes on it.
-
Merge: CI/CD picks up the change. Helm rolls it out automatically.
No click-ops. No config drift. No surprises.
TL;DR – Everything Declarative
Section | Role | Highlights |
---|---|---|
datasources |
Metric backends | Prometheus, CloudWatch, Elasticsearch, etc. |
contactPoints |
Alert receivers | Slack, PagerDuty, email, webhook endpoints |
policies |
Routing rules | Route by severity, service, labels |
groups |
Group alerts logically | Evaluated every N minutes, grouped by folder |
rules |
Alert conditions | Query + threshold + duration + metadata |
Here’s a full values.yaml
scaffolding for managing Grafana’s Unified Alerting via Helm. It includes example configurations for:
-
Datasources
-
Contact points
-
Notification policies
-
Alert groups
-
Alert rules
All fields are commented to help guide edits inline.
grafana:
enabled: true
## === 1. Datasources ===
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server.prometheus.svc.cluster.local
isDefault: true
## === 2. Unified Alerting Config ===
alerting:
enabled: true
unifiedAlerting: true
## === 3. Contact Points ===
contactPoints:
- name: slack
orgId: 1
receivers:
- uid: slack-notify
type: slack
settings:
recipient: '#infoservices-alerts-prod'
url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
- name: pagerduty
orgId: 1
receivers:
- uid: pagerduty-notify
type: pagerduty
settings:
routing_key: '<pagerduty_integration_key>'
## === 4. Notification Policies ===
policies:
- orgId: 1
receiver: slack
group_by:
- grafana_folder
- alertname
routes:
- receiver: pagerduty
object_matchers:
- - severity
- =
- critical
## === 5. Alert Groups ===
groups:
- name: rds_alerts_group
folder: aws_alerts
interval: 5m
orgId: 1
rules:
- title: CPU Utilization Alert
condition: C
data:
- refId: A
datasourceUid: cloudwatch
model:
namespace: AWS/RDS
metricName: CPUUtilization
region: us-east-1
statistic: Average
period: "300"
dimensions:
DBInstanceIdentifier: my-rds-instance
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [80]
type: gt
operator:
type: and
query:
params: [A]
type: threshold
for: 5m
labels:
severity: critical
team: sre_team
alert_type: CPU_Utilization
annotations:
summary: RDS CPU > 80%
description: CPU usage on RDS instance has exceeded 80% for 5 minutes.
Final Word
Grafana’s Unified Alerting engine lets you move fast without breaking observability. By managing alerts, contact points, and policies in Helm, then layering GitFlow on top, you get something most teams miss: repeatability. You know exactly what’s alerting, why, and who it notifies—because it’s all in code.
If your monitoring still lives in a UI somewhere, it’s time to promote it to the rest of your pipeline.
No comments:
Post a Comment