Prometheus & Grafana Monitoring Guide 2025

🚀 Why This Stack in 2025?

Prometheus + Grafana is the industry-standard monitoring stack with 75% Kubernetes adoption. Together they provide complete observability - metrics collection, visualization, and alerting.

Quick Stats:

✅ 60% faster incident detection
✅ 45% reduction in production issues
✅ Salary Impact: Monitoring skills add ₹5-15 LPA
✅ 75% adoption in Kubernetes environments
✅ Open-source and completely free

📦 Quick Installation

Using Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    restart: unless-stopped

volumes:
  prometheus_data:{}
  grafana_data:{}

# Start the stack
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f

🌐 Access URLs:

🎨 Grafana: http://localhost:3000 (admin/admin)
📊 Prometheus: http://localhost:9090
💻 Node Exporter: http://localhost:9100/metrics

🎯 Core Concepts

📊 Prometheus:

✅ Metrics: Time-series data collection
✅ PromQL: Powerful query language
✅ Exporters: Collect metrics from servers
✅ AlertManager: Handle alerts
✅ Service Discovery: Auto-detect targets
✅ Pull Model: Scrapes metrics periodically

🎨 Grafana:

✅ Dashboards: Visualize metrics beautifully
✅ Panels: Graphs, stats, tables, heatmaps
✅ Data Sources: Connect multiple sources
✅ Alerts: Set notification rules
✅ Variables: Dynamic dashboards
✅ Plugins: Extend functionality

📊 Prometheus Configuration

prometheus.yml Configuration:

# Global configuration
global:
  scrape_interval: 15s      # Scrape targets every 15 seconds
  evaluation_interval: 15s  # Evaluate rules every 15 seconds
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules once and periodically evaluate them
rule_files:
  - "alerts.yml"
  - "recording_rules.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (System metrics)
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          env: 'production'

  # Application metrics
  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
        labels:
          service: 'web-api'

  # Docker containers
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

🖥️ Essential Exporters

1. Node Exporter (System Metrics)

Collects hardware and OS metrics: CPU, memory, disk, network

# Docker
docker run -d \
  --name=node-exporter \
  -p 9100:9100 \
  prom/node-exporter

# Verify
curl http://localhost:9100/metrics

2. cAdvisor (Container Metrics)

Monitors Docker containers: CPU, memory, network, filesystem

# Docker
docker run -d \
  --name=cadvisor \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:ro \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  google/cadvisor:latest

3. Blackbox Exporter (Endpoint Probing)

Probes HTTP, HTTPS, DNS, TCP, ICMP endpoints

# Docker
docker run -d \
  --name=blackbox-exporter \
  -p 9115:9115 \
  prom/blackbox-exporter

# Probe HTTP endpoint
curl "http://localhost:9115/probe?target=https://example.com&module=http_2xx"

4. MySQL Exporter (Database Metrics)

Monitors MySQL/MariaDB performance

# Docker
docker run -d \
  --name=mysql-exporter \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="user:password@(mysql:3306)/" \
  prom/mysqld-exporter

📈 Grafana Dashboard Setup

Step 1: Add Prometheus Data Source

1. Open Grafana: http://localhost:3000
2. Login with admin/admin (change password)
3. Go to: Configuration → Data Sources → Add data source
4. Select: Prometheus
5. URL: http://prometheus:9090
6. Click: Save & Test (should show green checkmark)

Step 2: Import Pre-built Dashboards

Popular Dashboard IDs:

🖥️ 1860 - Node Exporter Full (System metrics)
☸️ 315 - Kubernetes Cluster Monitoring
🐳 893 - Docker Container & Host Metrics
🌐 7587 - Nginx Monitoring
🗄️ 7362 - MySQL Overview
📊 3662 - Prometheus 2.0 Stats

Import Steps:

1. Click + icon → Import
2. Enter Dashboard ID (e.g., 1860)
3. Click Load
4. Select Prometheus data source
5. Click Import
6. Dashboard ready instantly! 🎉

🔔 Setting Up Alerts

Prometheus Alert Rules (alerts.yml):

groups:
- name: system_alerts
  rules:
  # High CPU Alert
  - alert: HighCPULoad
    expr: node_load1 > 2
    for: 5m
    labels:
      severity: warning
      team: devops
    annotations:
      summary: "High CPU load on {{  $labels.instance  }} "
      description: "CPU load is {{  $value  }} (threshold: 2)"

  # High Memory Usage
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{  $labels.instance  }} "
      description: "Memory usage is {{  $value  }} % (threshold: 80%)"

  # Disk Space Low
  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low disk space on {{  $labels.instance  }} "

  # Service Down
  - alert: ServiceDown
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Service {{  $labels.job  }}  is down"

Grafana Alerting Setup:

1. Create Alert Rule:
- • Go to Alerting → Alert rules → New alert rule
- • Set query and condition (e.g., CPU > 80%)
- • Define evaluation interval
2. Configure Contact Points:
- • Alerting → Contact points → New contact point
- • Add Slack, Email, PagerDuty, Webhook
- • Test notification
3. Create Notification Policy:
- • Define routing based on labels
- • Set grouping and timing
- • Configure escalation

🐳 Kubernetes Monitoring

Install with Helm:

helm install prometheus prometheus-community/prometheus
helm install grafana grafana/grafana

Monitor Pods:

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage
container_memory_working_set_bytes{pod=~".+"}

💡 Pro Tips for 2025

Best Practices:

✅ Use Recording Rules: Pre-compute expensive queries
✅ Label Consistently: Use standard naming conventions
✅ Set Retention: 15-30 days for metrics (balance storage vs history)
✅ Monitor Prometheus: Watch its own resource usage
✅ Use rate() for counters: Always use rate() or irate() for counter metrics
✅ Aggregate wisely: Use sum(), avg(), max() to reduce cardinality
✅ Avoid high cardinality: Don't use user IDs or timestamps as labels
✅ Use federation: For multi-cluster monitoring
✅ Implement service discovery: Auto-discover targets in dynamic environments
✅ Backup Grafana: Export dashboards regularly
✅ Use variables: Make dashboards reusable with template variables
✅ Set up AlertManager: Centralize alert routing and silencing

Recording Rules Example:

# recording_rules.yml
groups:
- name: performance_rules
  interval: 30s
  rules:
  # Pre-compute HTTP request rate
  - record: job:http_requests:rate5m
    expr: rate(http_requests_total[5m])

  # Pre-compute error rate
  - record: job:http_errors:rate5m
    expr: rate(http_requests_total{status=~"5.."}[5m])

  # Pre-compute CPU usage
  - record: instance:node_cpu:avg_rate5m
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

🚨 Common Monitoring Scenarios

1. Website/API Monitoring

# HTTP request rate (requests per second)
rate(http_requests_total{status="200"}[5m])

# Error rate percentage
(rate(http_requests_total{status!~"2.."}[5m]) / rate(http_requests_total[5m])) * 100

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

2. Database Monitoring

# Active connections
pg_stat_database_numbackends{datname="mydb"}

# Query performance (transactions per second)
rate(pg_stat_database_xact_commit[5m])

# Cache hit ratio
(sum(pg_stat_database_blks_hit) / (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read))) * 100

# Slow queries
pg_stat_statements_mean_exec_time_seconds > 1

3. Container Monitoring

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100

# Container memory usage
container_memory_usage_bytes{name=~".+"}

# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

# Container restart count
kube_pod_container_status_restarts_total

4. System Monitoring

# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

📚 4-Week Learning Path

Week 1: Basics & Installation

✅ Install Prometheus & Grafana with Docker
✅ Understand metrics types (Counter, Gauge, Histogram, Summary)
✅ Configure prometheus.yml
✅ Add Node Exporter
✅ Project: Monitor your local machine

Week 2: Metrics Collection & PromQL

✅ Learn PromQL basics (rate, sum, avg)
✅ Add multiple exporters (cAdvisor, Blackbox)
✅ Practice queries in Prometheus UI
✅ Understand labels and filtering
✅ Project: Monitor Docker containers

Week 3: Grafana Dashboards

✅ Import pre-built dashboards
✅ Create custom dashboards
✅ Master panel types (Graph, Stat, Table, Heatmap)
✅ Use variables for dynamic dashboards
✅ Project: Build application monitoring dashboard

Week 4: Alerting & Production

✅ Configure AlertManager
✅ Create alert rules
✅ Set up notification channels (Slack, Email, PagerDuty)
✅ Implement recording rules
✅ Production best practices
✅ Project: Complete monitoring stack with alerts

💼 Career Impact

Role	With Monitoring Skills
Junior DevOps	₹10-15 LPA
Mid-level	₹18-28 LPA
Senior/Architect	₹30-45 LPA

💼 Career Impact 2025

Junior DevOps Engineer

Salary: ₹10-15 LPA

With monitoring skills: Basic dashboard creation, alert setup

Mid-Level SRE/DevOps

Salary: ₹18-28 LPA

Advanced PromQL, custom exporters, complex alerting

Senior/Architect

Salary: ₹30-45 LPA

Multi-cluster monitoring, observability strategy, SLO/SLI design

✅ Quick Start Checklist

Week 1-2:

☐ Install Prometheus & Grafana
☐ Add Node Exporter
☐ Import pre-built dashboards
☐ Learn basic PromQL queries
☐ Monitor local system

Week 3-4:

☐ Create custom dashboards
☐ Set up basic alerts
☐ Add application metrics
☐ Configure AlertManager
☐ Build production monitoring

Begin Monitoring: Your first metric is just a docker-compose up away!

Remember: You can't improve what you don't measure. Start monitoring today!