I learned this lesson the hard way: discovering your Proxmox host is melting down shouldn’t be a surprise. Trust me, finding out your storage is 98% full at 2am is not fun.

This guide walks you through setting up proper monitoring (Grafana + Prometheus + AlertManager) in an LXC container that’ll ping you on Telegram before things catch fire. No agents to install on the host, just a clean API-based setup.

⏱️ Time to complete: 45-60 minutes (hands-on, grab coffee)

Why Bother Monitoring Proxmox?

Look, whether you’re running a homelab or production infrastructure, flying blind is asking for trouble. I’ve seen way too many “unexpected” crashes that would’ve been totally preventable with basic monitoring.

Here’s what actually happens when you don’t monitor:

The silent killers:

  • Disk failure / ZFS degraded (discovered when it’s too late)
  • Root filesystem at 100% (good luck SSH’ing in)
  • RAM/swap exhausted (everything grinds to a halt)
  • Some VM eating 100% CPU (but which one?)
  • Backups silently failing (for weeks… ask me how I know)
  • Node goes down after an update (at 3am, naturally)
  • Crypto-miner hijacked a container (yes, this happens)

The reality: These issues don’t announce themselves. Your storage doesn’t email you when it hits 90%. Your node doesn’t text you before it overheats. You find out when something breaks.

That’s why we monitor.

Design the monitoring system

graph TB
  subgraph proxmox["Proxmox VE Host<br/>192.168.100.4:8006"]
      vm1[VM/LXC]
      vm2[VM/LXC]
      vm3[VM/LXC]
      vmore[...]
  end

  subgraph grafana_stack["Grafana-Stack LXC<br/>192.168.100.40"]
      pve[PVE Exporter :9221<br/>Pulls metrics from Proxmox API]

      prometheus[Prometheus :9090<br/>- Scrapes metrics 15s interval<br/>- Stores data 15-day retention<br/>- Evaluates alert rules]

      grafana[Grafana :3000<br/>Dashboards]

      alertmanager[AlertManager :9093<br/>Notifications]
  end

  slack[Slack<br/>Critical]
  telegram[Telegram<br/>Operational]

  proxmox -->|HTTPS API<br/>read-only| pve
  pve -->|metrics| prometheus
  prometheus -->|queries| grafana
  prometheus -->|alerts| alertmanager
  alertmanager -->|critical alerts| slack
  alertmanager -->|operational alerts| telegram

  style proxmox fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
  style grafana_stack fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
  style slack fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
  style telegram fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
  style pve fill:#fff9c4,stroke:#f9a825
  style prometheus fill:#ffebee,stroke:#c62828
  style grafana fill:#e3f2fd,stroke:#1565c0
  style alertmanager fill:#fce4ec,stroke:#c2185b

How This Setup Works

I went with an LXC container instead of a full VM because why waste 4GB of RAM on another kernel? Here’s the architecture:

The Proxmox side: Uses a read-only API token (so even if someone somehow gets access, they can’t break anything). No agents to install, no kernel modules - just hit the HTTPS API on port 8006.

The monitoring LXC (192.168.100.40):

  • pve-exporter - Queries Proxmox API every 15 seconds, grabs metrics for nodes, VMs, ZFS pools, backups, everything
  • Prometheus - Scrapes those metrics, keeps 15 days of history, checks if anything’s on fire
  • Grafana - Makes it all pretty with ready-made dashboards
  • AlertManager - Sends you notifications when things go sideways:
    • Critical stuff (node down, disk failure) → Slack
    • Operational warnings (high load, backup failed) → Telegram

What’s Good (and What’s Not)

Why I like this setup:

  • Simple - Just three services in one LXC
  • Lightweight - Uses maybe 2GB RAM total
  • Safe - Read-only API token can’t break anything
  • Free - Zero licensing costs
  • Beautiful dashboards out of the box

What could be better:

  • Single point of failure (the LXC goes down, monitoring’s gone - though you can enable HA)
  • Limited to 15 days history by default (fine for most cases, but you can extend it)
  • No built-in long-term storage (for that you’d need Thanos or VictoriaMetrics)
  • Exposed to your network (put it behind a reverse proxy if you’re paranoid)

For a homelab or small production setup? This is plenty. If you’re running 50+ nodes, you’ll want something beefier.

Prerequisites

This is exactly based on my local infrastructure:

  • Proxmox VE 9.1.1 installed and running
  • Debian 13 LXC template downloaded in Proxmox
  • Basic understanding of Linux commands
  • Telegram account (for alerts)
  • Slack workspace (optional, for critical alerts)

Step 1: Create LXC Container

Create an unprivileged Debian 13 LXC container for the Grafana stack.

Why LXC Instead of VM?

LXC containers are the optimal choice for this monitoring stack, offering significant resource efficiency without sacrificing functionality.

Key Benefits:

  • Lower overhead: LXC shares the host kernel, consuming ~50-70% less RAM than a VM (8GB LXC vs 12-14GB VM)
  • Faster performance: Near-native CPU performance without virtualization overhead
  • Quick startup: Container boots in 2-3 seconds vs 30-60 seconds for a VM
  • Smaller disk footprint: 50GB LXC vs 80-100GB VM (no separate OS kernel/modules)
  • Easy snapshots: Instant container snapshots for backup/rollback

No Feature Limitations:

  • All Grafana stack components (Grafana, Prometheus, Loki, AlertManager) run perfectly in LXC
  • Network services work identically to VMs
  • No Docker required (native package installations)
  • Full access to Proxmox API for metrics collection

Specifications:

  • VMID: 140
  • Hostname: grafana-stack
  • Template: Debian 13 standard
  • CPU: 4 cores
  • RAM: 8GB
  • Disk: 50GB (local-zfs)
  • Network: Static IP 192.168.100.40/24
# On Proxmox host
pct create 140 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
  --hostname grafana-stack \      # Container name
  --cores 4 \
  --memory 8192 \
  --swap 2048 \
  --rootfs local-zfs:50 \
  --net0 name=eth0,bridge=vmbr0,ip=192.168.100.40/24,gw=192.168.100.1 \
  --nameserver 8.8.8.8 \
  --unprivileged 1 \              # Safer unprivileged container
  --features nesting=0 \          # Nesting OFF - no docker will be installed in this container - container in a container
  --sshkeys /root/.ssh/your-key \ # injects your workstation keys
  --start 1                       # Start container after created

Here is what you will have when check in Proxmox UI:

grafana-lxc on proxmox

Step 2: Configure LXC Network

LXC containers don’t use cloud-init, so network configuration must be done manually.

# Access container console via Proxmox UI or:
pct enter 140

# Configure network
cat > /etc/network/interfaces << 'EOF'
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
    address 192.168.100.40
    netmask 255.255.255.0
    gateway 192.168.100.1
    dns-nameservers 8.8.8.8 8.8.4.4
EOF

# Restart networking
systemctl restart networking

# Verify
ip addr show eth0
ping -c 3 8.8.8.8

Step 3: Install Grafana Stack

Install all monitoring components using the automated bash script. You can download the grafana-stack-setup.sh script.

# Update system
apt update && apt upgrade -y

# Install dependencies
apt install -y apt-transport-https wget curl gnupg2 ca-certificates \
  python3 python3-pip unzip

# Download installation script
wget https://gist.githubusercontent.com/sule9985/fabf9e4ebcd9bd93019bd0a5ada5d827/raw/8c7c3f8bf5aa28bba4585142ec876a001b18f63a/grafana-stack-setup.sh
chmod +x grafana-stack-setup.sh

# Run installation
./grafana-stack-setup.sh

The script installs:

  • Grafana 12.3.0 - Visualization platform
  • Prometheus 3.7.3 - Metrics collection and storage
  • Loki 3.6.0 - Log aggregation
  • AlertManager 0.29.0 - Alert routing and notifications
  • Proxmox PVE Exporter 3.5.5 - Proxmox metrics collector

Installation takes ~5-10 minutes and you can see the good results in Terminal like this:

=============================================
VERIFYING INSTALLATION
=============================================


[STEP] Checking service status...

 grafana-server: running
 prometheus: running
 loki: running
 alertmanager: running
  ! prometheus-pve-exporter: not configured

[STEP] Checking network connectivity...

 Port 3000 (Grafana): listening
 Port 9090 (Prometheus): listening
 Port 3100 (Loki): listening
 Port 9093 (AlertManager): listening

[SUCCESS] All services verified successfully!

[SUCCESS] Installation completed successfully in 53 seconds!

Step 4: Create Proxmox Monitoring User

Create a read-only user on Proxmox for the PVE Exporter to collect metrics.

# SSH to Proxmox host
ssh [email protected]

# Create monitoring user
pveum user add grafana-user@pve --comment "Grafana monitoring user"

# Assign read-only permissions
pveum acl modify / --user grafana-user@pve --role PVEAuditor

# Create API token
pveum user token add grafana-user@pve grafana-token --privsep 0

# Save the token output!
# Example: 8a7b6c5d-1234-5678-90ab-cdef12345678

Important: Save the full token value - it’s only shown once!

Step 5: Configure PVE Exporter

Now we tell the exporter how to talk to Proxmox. SSH into your LXC:

# On grafana-stack LXC
ssh -i PATH_TO_YOUR_KEY [email protected]

# Edit PVE exporter configuration
nano /etc/prometheus-pve-exporter/pve.yml

Here’s the config - pay attention to the token part, this tripped me up the first time:

default:
  user: grafana-user@pve
  # IMPORTANT: Create a read-only user on Proxmox for monitoring
  # On Proxmox host:
  # Then add the token here:
  token_name: 'grafana-token'
  token_value: 'TOKEN_VALUE' # ⚠️ This only shows ONCE when you create it. If you lost it, make a new one.
  # OR use password:
  # password: "CHANGE_ME"
  verify_ssl: false # Self-signed cert? Set this to false or you'll get TLS errors

# Target Proxmox hosts
pve1:
  user: grafana-user@pve
  token_name: 'grafana-token'
  token_value: 'TOKEN_VALUE'
  verify_ssl: false
  target: https://192.168.100.4:8006 # Your Proxmox host IP + port 8006 (HTTPS, not HTTP!)

Pro tip: The token only appears once when you create it in Proxmox. If you closed the window without copying it… yeah, you’ll need to create a new one. Ask me how I know.

Fire it up:

# Start service
systemctl start prometheus-pve-exporter

# Verify it's actually running (not just "enabled")
root@grafana-stack:~# systemctl status prometheus-pve-exporter.service
 prometheus-pve-exporter.service - Prometheus Proxmox VE Exporter
     Loaded: loaded (/etc/systemd/system/prometheus-pve-exporter.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-11-23 11:22:06 +07; 4 days ago
 Invocation: 1c35a29336b346e8b553b74a4d8fc533
       Docs: https://github.com/prometheus-pve/prometheus-pve-exporter
   Main PID: 10509 (pve_exporter)
      Tasks: 4 (limit: 75893)
     Memory: 44.4M (peak: 45.2M)
        CPU: 27min 52.526s
     CGroup: /system.slice/prometheus-pve-exporter.service
             ├─10509 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221
             └─10550 /usr/bin/python3 /usr/local/bin/pve_exporter --config.file=/etc/prometheus-pve-exporter/pve.yml --web.listen-address=0.0.0.0:9221

See “active (running)”? Good. If you see “failed” or errors about TLS, check your verify_ssl setting and make sure the Proxmox IP is correct.

Step 6: Configure Prometheus Scraping

Now we tell Prometheus where to grab the metrics from:

# Edit Prometheus config
nano /etc/prometheus/prometheus.yml

Add this job config to your scrape_configs section:

scrape_configs:
  # ──────────────────────────────────────────────────────────────
  # Proxmox VE monitoring via pve-exporter (runs inside the LXC)
  # ──────────────────────────────────────────────────────────────
  - job_name: 'proxmox' # Friendly name shown in Prometheus/Grafana
    metrics_path: '/pve' # Endpoint where pve-exporter serves Proxmox metrics
    params:
      target:
        ['192.168.100.4:8006'] # Your Proxmox node (or cluster) + GUI port
        # Supports multiple nodes: ['node1:8006','node2:8006']
    static_configs:
      - targets: ['localhost:9221'] # Where pve-exporter is listening inside this LXC
        labels:
          service: 'proxmox-pve' # Custom label – helps filtering in Grafana
          instance: 'pve-host' # Logical name for your cluster/node

What’s happening here: Prometheus scrapes localhost:9221/pve (the exporter), which then queries your Proxmox API at 192.168.100.4:8006. It’s a proxy setup - Prometheus never talks directly to Proxmox.

Kick Prometheus to pick up the new config:

# Restart Prometheus
systemctl restart prometheus

Sanity check time. Open http://192.168.100.40:9090/targets in your browser. You should see your proxmox target showing UP in green:

prometheus targets

If it’s DOWN, check the exporter service and your firewall rules. Don’t skip this step - if Prometheus can’t reach the exporter, nothing else will work.

One more quick test: Click Graph in the Prometheus UI, type pve_cpu_usage_limit in the query box, hit Execute. You should see actual CPU metrics:

prometheus query pve cpu usage limit

Seeing numbers? Perfect. Your Proxmox API is talking to the exporter, and Prometheus is scraping it correctly.

Quick inventory check (what we’ve got so far):

  • Proxmox host (v9.1.1) - Read-only user grafana-user@pve with API token
  • Monitoring LXC (192.168.100.40) running:
    • Grafana (v12.3.0)
    • Prometheus (v3.7.3) + PVE Exporter (v3.5.5)
    • AlertManager (v0.29.0)
    • Loki (v3.6.0)

Zero agents on the Proxmox host. Everything queries the API remotely.

Security Best Practices
  • Change default password: Immediately change Grafana’s default admin/admin credentials on first login - Configure firewall: Restrict access to ports 3000 (Grafana), 9090 (Prometheus), 9093 (AlertManager) to your internal network only - Use reverse proxy: For external access, deploy a reverse proxy (Nginx/Traefik) with TLS and authentication - Update API token permissions: The Proxmox API token has read-only access (PVEAuditor role), limiting exposure if compromised

Step 7: Import Grafana Dashboard

Time for the fun part - actually seeing your data. Head to Grafana:

  1. Open http://192.168.100.40:3000 in your browser
  2. Login with admin/admin (seriously, change this password on first login)
  3. Navigate to DashboardsNewImport
  4. Enter Dashboard ID: 10347 (the official Proxmox dashboard - it’s excellent)
  5. Click Load
  6. Select Prometheus as the datasource
  7. Click Import

Boom. You should see something like this:

grafana dashboard pve host

Step 8: Set Up Alerting

Pretty dashboards are nice, but what you really need is something to wake you up at 3am when your node’s on fire. Let’s set up notifications via Telegram (and optionally Slack).

Step 8a: Create Notification Channels

This takes like 2 minutes:

  1. Open Telegram, search for @BotFather
  2. Send /newbot
  3. Pick a name and username for your bot
  4. Save the Bot Token (you’ll need this in a minute)
  5. Start a chat with your new bot (send /start)
  6. Get your Chat ID by visiting: https://api.telegram.org/bot<YOUR_TOKEN>/getUpdates
    • Look for "chat":{"id":123456789} in the JSON response
    • That number is your Chat ID

Pro tip: Keep this browser tab open. You’ll paste both values into AlertManager config shortly.

Slack Webhook (Optional, but nice for team alerts)

If you’ve got a team Slack:

  1. Go to https://api.slack.com/apps
  2. Create New AppFrom scratch
  3. Enable Incoming Webhooks
  4. Add New Webhook to Workspace
  5. Pick your channel (e.g., #infrastructure-alerts)
  6. Copy the Webhook URL (starts with https://hooks.slack.com/...)

I use Telegram for “wake me up” alerts and Slack for “FYI the team should know” stuff.

Step 8b: Configure Prometheus Alert Rules

How Alerting Works (The Quick Version)

Alerting has two parts, and mixing them up is where most people get confused:

Prometheus Alert Rules = What fires alerts

  • Checks metrics every 30 seconds: “Is CPU > 85%? Is disk > 90%?”
  • Adds labels like severity: critical or notification_channel: telegram
  • Sends matching alerts to AlertManager

AlertManager = Where alerts go

  • Reads the labels Prometheus sent
  • Routes to Telegram, Slack, email, whatever
  • Groups similar alerts (so you don’t get 50 spam messages)
  • Deduplicates and silences repeats

Why split it up? Because Prometheus is good at math (evaluating metrics), and AlertManager is good at logistics (routing notifications). Keeps things clean.

What we’re building:

  • Telegram gets everything (CPU warnings, disk full, node down)
  • Slack optional for team notifications
  • Two-tier alerts: warning (80-85%) and critical (90-95%)
  • Smart suppression - if critical fires, warning shuts up
Monitoring Scope

This setup focuses on monitoring the Proxmox host infrastructure only, not individual VMs/LXCs.

  • Why this approach? It simplifies the monitoring stack and reduces complexity
  • For VM/LXC monitoring: Deploy dedicated exporters (Node Exporter, application-specific exporters) inside each VM/LXC for more accurate, granular metrics
  • Separation of concerns: Host-level monitoring (this guide) + VM-level monitoring (separate exporters) provides better visibility than a single solution trying to do everything
flowchart TD
  subgraph Prometheus["📊 Prometheus"]
      A[Metrics Collection<br/>PVE Exporter] --> B[Alert Rules Evaluation<br/>Every 30s-1m]
      B --> C{Condition<br/>Met?}
  end

  C -->|"Yes"| D[Send Alert to AlertManager]
  C -->|"No"| E[Continue Monitoring]
  E --> A

  subgraph AlertManager["🔔 AlertManager"]
      D --> F[Receive Alerts]
      F --> G[Grouping & Deduplication<br/>group_wait: 10s-30s]
      G --> H{Route by<br/>Label}
      H --> I[Apply Inhibition Rules<br/>Suppress warnings if critical firing]
  end

  subgraph Examples["Example Alert Conditions"]
      J1["CPU > 85% (warning)<br/>label: notification_channel=telegram"]
      J2["Storage > 80% (warning)<br/>label: notification_channel=telegram"]
  end

  subgraph Notifications["📬 Notification Channels"]
      K["🔴 Telegram<br/>Operational Alerts<br/>• Host CPU/Memory/Disk<br/>• Storage Alerts<br/>• Repeat every 1-2h"]
      L["💬 Slack<br/>(Optional)<br/>"]
  end

  I -->|"notification_channel:<br/>telegram"| K
  I -->|"notification_channel:<br/>slack"| L

  style Prometheus fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
  style AlertManager fill:#fff4e1,stroke:#ff9900,stroke-width:2px
  style Examples fill:#f0f0f0,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
  style Notifications fill:#e8f5e9,stroke:#00aa00,stroke-width:2px
  style K fill:#4a90e2,color:#fff,stroke:#2563eb,stroke-width:2px
  style L fill:#0088cc,color:#fff,stroke:#0066aa,stroke-width:2px
  style C fill:#ffd700,stroke:#ff8800
  style H fill:#ffd700,stroke:#ff8800

Alright, let’s create the actual alert rules. SSH into your LXC and edit:

# Create alert rules file
nano /etc/prometheus/rules/proxmox.yml

Here’s the full config. I’ve included the essentials - CPU, memory, disk, storage. You can add more later:

groups:
  # ============================================================================
  # Group 1: Host Alerts (Critical Infrastructure)
  # ============================================================================
  - name: proxmox_host_alerts
    interval: 30s
    rules:
      # Proxmox Host Down
      - alert: ProxmoxHostDown
        expr: pve_up{id="node/pve"} == 0
        for: 1m
        labels:
          severity: critical
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '🔴 Proxmox host is down'
          description: "Proxmox host 'pve' is unreachable or down for more than 1 minute."

      # High CPU Usage
      - alert: ProxmoxHighCPU
        expr: pve_cpu_usage_ratio{id="node/pve"} > 0.85
        for: 5m
        labels:
          severity: warning
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '⚠️ High CPU usage on Proxmox host'
          description: 'CPU usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 85%).'

      # Critical CPU Usage
      - alert: ProxmoxCriticalCPU
        expr: pve_cpu_usage_ratio{id="node/pve"} > 0.95
        for: 2m
        labels:
          severity: critical
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '🔴 CRITICAL CPU usage on Proxmox host'
          description: 'CPU usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 95%).'

      # High Memory Usage
      - alert: ProxmoxHighMemory
        expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.85
        for: 5m
        labels:
          severity: warning
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '⚠️ High memory usage on Proxmox host'
          description: 'Memory usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 85%).'

      # Critical Memory Usage
      - alert: ProxmoxCriticalMemory
        expr: (pve_memory_usage_bytes{id="node/pve"} / pve_memory_size_bytes{id="node/pve"}) > 0.95
        for: 2m
        labels:
          severity: critical
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '🔴 CRITICAL memory usage on Proxmox host'
          description: 'Memory usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 95%).'

      # High Disk Usage
      - alert: ProxmoxHighDiskUsage
        expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.80
        for: 10m
        labels:
          severity: warning
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '⚠️ High disk usage on Proxmox host'
          description: 'Disk usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 80%).'

      # Critical Disk Usage
      - alert: ProxmoxCriticalDiskUsage
        expr: (pve_disk_usage_bytes{id="node/pve"} / pve_disk_size_bytes{id="node/pve"}) > 0.90
        for: 5m
        labels:
          severity: critical
          component: proxmox
          alert_group: host_alerts
          notification_channel: telegram
        annotations:
          summary: '🔴 CRITICAL disk usage on Proxmox host'
          description: 'Disk usage is {{ $value | humanizePercentage }} on Proxmox host (threshold: 90%).'

  # Group 2: Storage Alerts (Telegram - Operational Alerts)
  - name: proxmox_storage_alerts
    interval: 1m
    rules:
      # Storage Pool High Usage
      - alert: ProxmoxStorageHighUsage
        expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.80
        for: 10m
        labels:
          severity: warning
          component: proxmox
          alert_group: storage_alerts
          notification_channel: telegram
        annotations:
          summary: '⚠️ High usage on storage {{ $labels.storage }}'
          description: "Storage '{{ $labels.storage }}' usage is {{ $value | humanizePercentage }} (threshold: 80%)."

      # Storage Pool Critical Usage
      - alert: ProxmoxStorageCriticalUsage
        expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) > 0.90
        for: 5m
        labels:
          severity: critical
          component: proxmox
          alert_group: storage_alerts
          notification_channel: telegram
        annotations:
          summary: '🔴 CRITICAL usage on storage {{ $labels.storage }}'
          description: "Storage '{{ $labels.storage }}' usage is {{ $value | humanizePercentage }} (threshold: 90%)."

Step 8c: Configure AlertManager Routing

nano /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

# Routing tree - directs alerts to receivers based on labels
route:
  # Default grouping and timing
  group_by: ['alertname', 'severity', 'alert_group']
  group_wait: 10s # Wait before sending first notification
  group_interval: 10s # Wait before sending notifications for new alerts in group
  repeat_interval: 12h # Resend notification every 12 hours if still firing

  # Default receiver for unmatched alerts
  receiver: 'telegram-default'

  # Child routes - matched in order, first match wins
  routes:
    # Route 1: Slack for host_alerts (critical infrastructure)
    - match:
        notification_channel: slack
      receiver: 'slack-channel'
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h # Repeat every hour for critical infrastructure
      continue: false # Stop matching after this route

    # Route 2: Telegram for telegram channel alerts (storage)
    - match:
        notification_channel: telegram
      receiver: 'telegram-operational'
      group_wait: 30s
      group_interval: 30s
      repeat_interval: 2h # Repeat every 2 hours for operational alerts
      continue: false

# Notification receivers
receivers:
  # Slack receiver for critical infrastructure (host alerts)
  - name: 'slack-channel'
    slack_configs:
      - api_url: 'SLACK_WEBHOOK'
        channel: '#alerts-test'
        username: 'Prometheus AlertManager'
        icon_emoji: ':warning:'
        title: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.severity | toUpper }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Component:* {{ .Labels.component }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}
        send_resolved: true
        # Optional: Mention users for critical alerts
        # color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

  # Telegram receiver for operational alerts (storage)
  - name: 'telegram-operational'
    telegram_configs:
      - bot_token: 'BOT_TOKEN'
        chat_id: CHAT_ID_NUMBERS
        parse_mode: 'HTML'
        message: |
          {{ range .Alerts }}
          <b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>

          {{ .Annotations.summary }}

          <b>Details:</b>
          {{ .Annotations.description }}

          <b>Component:</b> {{ .Labels.component }}
          <b>Group:</b> {{ .Labels.alert_group }}
          <b>Status:</b> {{ .Status }}
          {{ end }}
        send_resolved: true

  # Default Telegram receiver (fallback)
  - name: 'telegram-default'
    telegram_configs:
      - bot_token: 'BOT_TOKEN'
        chat_id: CHAT_ID_NUMBERS
        parse_mode: 'HTML'
        message: |
          {{ range .Alerts }}
          <b>{{ .Labels.severity | toUpper }}: {{ .Labels.alertname }}</b>

          {{ .Annotations.summary }}
          {{ .Annotations.description }}

          <b>Component:</b> {{ .Labels.component }}
          {{ end }}
        send_resolved: true

# Inhibition rules - suppress alerts based on other alerts
inhibit_rules:
  # If critical alert is firing, suppress warning alerts for same component
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['component', 'alertname']

Before you reload anything, validate your configs. Trust me on this - a typo will break everything:

# Validate configs (do this FIRST!)
promtool check rules /etc/prometheus/rules/proxmox.yml
amtool check-config /etc/alertmanager/alertmanager.yml

# If validation passed, reload
curl -X POST http://localhost:9090/-/reload
systemctl restart alertmanager

Sanity check: Open http://192.168.100.40:9090/alerts in your browser. You should see all your alert rules listed (even if they’re not firing yet):

prometheus alert

Step 9: Test Alerts

Don’t skip this step. You don’t want to find out your alerts don’t work when your node is actually on fire.

Fire off some test alerts to make sure routing works:

# Test Slack alert (if you set it up)
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestSlack",
      "notification_channel": "slack",
      "severity": "warning"
    },
    "annotations": {
      "summary": "Test Slack Alert",
      "description": "This is a test alert sent manually to verify Slack routing"
    }
  }]'

# Test Telegram alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[
  {
    "labels": {
      "alertname": "TestTelegram",
      "notification_channel": "telegram",
      "severity": "warning"
    },
    "annotations": {
      "summary": "Test Telegram Alert",
      "description": "This is a test"
    }
  }
]'

Check your phone/Slack. Within 10-30 seconds you should see messages:

slack receives alerts telegram receives alerts

Troubleshooting Common Issues

Here are the issues that drove me nuts when I first set this up. Save yourself some time.

PVE Exporter Won’t Start / Shows “Not Configured”

This one got me for like 20 minutes the first time. The service starts, but when you check status it says “not configured” or just dies.

What’s probably wrong:

  1. Config file doesn’t exist or is in the wrong place

    cat /etc/prometheus-pve-exporter/pve.yml
    # If you get "No such file", well... there's your problem
  2. API token format is wrong - The format is picky:

    • Should be: token_name: "grafana-token" and token_value: "8a7b6c5d-1234-5678..."
    • NOT the full PVEAPIToken=user@pve!token=value string
    • If you copied the wrong thing, the exporter will silently fail
  3. Wrong Proxmox IP or can’t reach it

    curl -k https://192.168.100.4:8006/api2/json/nodes \
      -H "Authorization: PVEAPIToken=grafana-user@pve!grafana-token=YOUR_TOKEN"
    # Should return JSON. If timeout/connection refused, check your network/firewall
  4. Check the actual error in logs

    journalctl -u prometheus-pve-exporter -f
    # Usually tells you exactly what's broken

Prometheus Target Shows “Context Deadline Exceeded”

Translation: Prometheus can’t scrape the exporter in time. Usually means network issues or SSL problems.

Quick fixes:

  1. Firewall blocking port 8006 - Can the LXC reach Proxmox?

    # From inside LXC
    curl -k https://192.168.100.4:8006
    # If this times out, your firewall's blocking it
  2. SSL certificate problems - Self-signed cert on Proxmox? Set verify_ssl: false in /etc/prometheus-pve-exporter/pve.yml (already in our config)

  3. Scraping too slow - Increase timeout in /etc/prometheus/prometheus.yml:

    scrape_configs:
      - job_name: 'proxmox'
        scrape_timeout: 30s # Bump from default 10s

Grafana Shows “No Data” on Dashboard

Dashboard imported fine, but all the panels are empty. Frustrating.

Debug steps:

  1. Is Prometheus actually working?

    • Grafana → Configuration → Data Sources → Prometheus
    • Click Test - should say “Data source is working”
    • If it fails, Prometheus isn’t running or wrong URL
  2. Are metrics actually being collected?

    • Open http://192.168.100.40:9090 (Prometheus UI)
    • Graph tab → query pve_up
    • Should show 1 if exporter is working
    • If nothing shows up, go back to fixing the exporter
  3. Time range issue - Dashboard looking at last 6 hours, but you just started collecting data? Change time range to “Last 15 minutes” and see if data appears

Alerts Configured But Nothing Happens

You set up all the alerts, but your Telegram/Slack is crickets even when you know CPU is maxed.

What to check:

  1. Are the alerts even firing in Prometheus?

    • Open http://192.168.100.40:9090/alerts
    • Green = OK (not firing)
    • Yellow = Pending (condition met, waiting for “for” duration)
    • Red = FIRING (should be sending to AlertManager)
  2. Is AlertManager receiving them?

    curl http://localhost:9093/api/v2/alerts
    # Should show active alerts if any are firing
  3. Check AlertManager logs - routing might be broken:

    journalctl -u alertmanager -f
    # Look for errors about failed receivers or routing
  4. Did you test manually? Go back to Step 9, fire a test alert. If that doesn’t work, your Telegram token or Slack webhook is wrong.

Permission Denied / 403 Errors from Proxmox API

The exporter’s hitting the API but getting rejected.

Usually one of these:

  1. Wrong permissions on the user

    # On Proxmox host
    pveum user permission list grafana-user@pve
    # Should show "PVEAuditor" role on path "/"
    # If not, go back to Step 4 and fix it
  2. Token got nuked somehow (happens after Proxmox updates sometimes)

    # Recreate it
    pveum user token remove grafana-user@pve grafana-token
    pveum user token add grafana-user@pve grafana-token --privsep 0
    # Update the token in pve.yml with the new value
  3. Token expired - Tokens don’t expire by default, but check Proxmox UI (Datacenter → Permissions → API Tokens) just in case someone set an expiration

Monitoring Metrics

Key metrics available:

Host Metrics:

  • pve_cpu_usage_ratio - CPU usage (0-1)
  • pve_memory_usage_bytes / pve_memory_size_bytes - Memory usage
  • pve_disk_usage_bytes / pve_disk_size_bytes - Disk usage
  • pve_up{id="node/pve"} - Host availability

Storage Metrics:

  • pve_disk_usage_bytes{id=~"storage/.*"} - Storage pool usage
  • pve_storage_info - Storage pool information

Alert Rules Summary

AlertThresholdDurationChannelSeverity
Host Alerts (Telegram - Critical Infrastructure)
ProxmoxHostDown== 01 minTelegramCritical
ProxmoxHighCPU>85%5 minTelegramWarning
ProxmoxCriticalCPU>95%2 minTelegramCritical
ProxmoxHighMemory>85%5 minTelegramWarning
ProxmoxCriticalMemory>95%2 minTelegramCritical
ProxmoxHighDiskUsage>80%10 minTelegramWarning
ProxmoxCriticalDiskUsage>90%5 minTelegramCritical
Storage Alerts (Telegram - Operational)
ProxmoxStorageHighUsage>80%10 minTelegramWarning
ProxmoxStorageCriticalUsage>90%5 minTelegramCritical

The Bottom Line

If you followed along, you now have:

  • Grafana dashboards showing exactly what’s happening on your Proxmox host
  • Telegram alerts that’ll ping you before things explode (not after)
  • Comprehensive monitoring without installing anything on the Proxmox host itself

All of this running in a single 8GB LXC container. No bloated VMs, no agents cluttering up your host, just clean API-based monitoring.

What you should do next:

  1. Tune those alert thresholds - 85% CPU might be fine for your workload, or way too high
  2. Add more storage pools if you have them (just copy the alert rules)
  3. Set up a reverse proxy with SSL if you’re exposing Grafana externally
  4. Maybe add VM/LXC monitoring later (different exporters, separate guide)

The real test: Will you actually notice when something breaks? Run those test alerts (Step 9) again in a week to make sure everything’s still working. AlertManager configs have a way of breaking silently.

And hey, when you do get woken up at 3am by a disk space alert, at least you’ll know before your backups fail and users start complaining. Ask me how I know that’s worth the setup time.

Resources