Prometheus & Grafana Monitoring Guide 2025

January 202510 min read

🚀 Why This Stack in 2025?

Prometheus + Grafana is the industry-standard monitoring stack with 75% Kubernetes adoption. Together they provide complete observability - metrics collection, visualization, and alerting.

Quick Stats:

  • 60% faster incident detection
  • 45% reduction in production issues
  • Salary Impact: Monitoring skills add ₹5-15 LPA
  • 75% adoption in Kubernetes environments
  • Open-source and completely free

📦 Quick Installation

Using Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    restart: unless-stopped

volumes:
  prometheus_data:{}
  grafana_data:{}
# Start the stack
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f

🌐 Access URLs:

  • 🎨 Grafana: http://localhost:3000 (admin/admin)
  • 📊 Prometheus: http://localhost:9090
  • 💻 Node Exporter: http://localhost:9100/metrics

🎯 Core Concepts

📊 Prometheus:

  • Metrics: Time-series data collection
  • PromQL: Powerful query language
  • Exporters: Collect metrics from servers
  • AlertManager: Handle alerts
  • Service Discovery: Auto-detect targets
  • Pull Model: Scrapes metrics periodically

🎨 Grafana:

  • Dashboards: Visualize metrics beautifully
  • Panels: Graphs, stats, tables, heatmaps
  • Data Sources: Connect multiple sources
  • Alerts: Set notification rules
  • Variables: Dynamic dashboards
  • Plugins: Extend functionality

📊 Prometheus Configuration

prometheus.yml Configuration:

# Global configuration
global:
  scrape_interval: 15s      # Scrape targets every 15 seconds
  evaluation_interval: 15s  # Evaluate rules every 15 seconds
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules once and periodically evaluate them
rule_files:
  - "alerts.yml"
  - "recording_rules.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (System metrics)
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          env: 'production'

  # Application metrics
  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
        labels:
          service: 'web-api'

  # Docker containers
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

🖥️ Essential Exporters

1. Node Exporter (System Metrics)

Collects hardware and OS metrics: CPU, memory, disk, network

# Docker
docker run -d \
  --name=node-exporter \
  -p 9100:9100 \
  prom/node-exporter

# Verify
curl http://localhost:9100/metrics

2. cAdvisor (Container Metrics)

Monitors Docker containers: CPU, memory, network, filesystem

# Docker
docker run -d \
  --name=cadvisor \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:ro \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  google/cadvisor:latest

3. Blackbox Exporter (Endpoint Probing)

Probes HTTP, HTTPS, DNS, TCP, ICMP endpoints

# Docker
docker run -d \
  --name=blackbox-exporter \
  -p 9115:9115 \
  prom/blackbox-exporter

# Probe HTTP endpoint
curl "http://localhost:9115/probe?target=https://example.com&module=http_2xx"

4. MySQL Exporter (Database Metrics)

Monitors MySQL/MariaDB performance

# Docker
docker run -d \
  --name=mysql-exporter \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="user:password@(mysql:3306)/" \
  prom/mysqld-exporter

📈 Grafana Dashboard Setup

Step 1: Add Prometheus Data Source

  1. 1. Open Grafana: http://localhost:3000
  2. 2. Login with admin/admin (change password)
  3. 3. Go to: Configuration → Data Sources → Add data source
  4. 4. Select: Prometheus
  5. 5. URL: http://prometheus:9090
  6. 6. Click: Save & Test (should show green checkmark)

Step 2: Import Pre-built Dashboards

Popular Dashboard IDs:

  • 🖥️ 1860 - Node Exporter Full (System metrics)
  • ☸️ 315 - Kubernetes Cluster Monitoring
  • 🐳 893 - Docker Container & Host Metrics
  • 🌐 7587 - Nginx Monitoring
  • 🗄️ 7362 - MySQL Overview
  • 📊 3662 - Prometheus 2.0 Stats

Import Steps:

  1. 1. Click + icon → Import
  2. 2. Enter Dashboard ID (e.g., 1860)
  3. 3. Click Load
  4. 4. Select Prometheus data source
  5. 5. Click Import
  6. 6. Dashboard ready instantly! 🎉

🔔 Setting Up Alerts

Prometheus Alert Rules (alerts.yml):

groups:
- name: system_alerts
  rules:
  # High CPU Alert
  - alert: HighCPULoad
    expr: node_load1 > 2
    for: 5m
    labels:
      severity: warning
      team: devops
    annotations:
      summary: "High CPU load on {{  $labels.instance  }} "
      description: "CPU load is {{  $value  }} (threshold: 2)"

  # High Memory Usage
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{  $labels.instance  }} "
      description: "Memory usage is {{  $value  }} % (threshold: 80%)"

  # Disk Space Low
  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low disk space on {{  $labels.instance  }} "

  # Service Down
  - alert: ServiceDown
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Service {{  $labels.job  }}  is down"

Grafana Alerting Setup:

  1. 1. Create Alert Rule:
    • • Go to Alerting → Alert rules → New alert rule
    • • Set query and condition (e.g., CPU > 80%)
    • • Define evaluation interval
  2. 2. Configure Contact Points:
    • • Alerting → Contact points → New contact point
    • • Add Slack, Email, PagerDuty, Webhook
    • • Test notification
  3. 3. Create Notification Policy:
    • • Define routing based on labels
    • • Set grouping and timing
    • • Configure escalation

🐳 Kubernetes Monitoring

Install with Helm:

helm install prometheus prometheus-community/prometheus
helm install grafana grafana/grafana

Monitor Pods:

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage
container_memory_working_set_bytes{pod=~".+"}  

💡 Pro Tips for 2025

Best Practices:

  • Use Recording Rules: Pre-compute expensive queries
  • Label Consistently: Use standard naming conventions
  • Set Retention: 15-30 days for metrics (balance storage vs history)
  • Monitor Prometheus: Watch its own resource usage
  • Use rate() for counters: Always use rate() or irate() for counter metrics
  • Aggregate wisely: Use sum(), avg(), max() to reduce cardinality
  • Avoid high cardinality: Don't use user IDs or timestamps as labels
  • Use federation: For multi-cluster monitoring
  • Implement service discovery: Auto-discover targets in dynamic environments
  • Backup Grafana: Export dashboards regularly
  • Use variables: Make dashboards reusable with template variables
  • Set up AlertManager: Centralize alert routing and silencing

Recording Rules Example:

# recording_rules.yml
groups:
- name: performance_rules
  interval: 30s
  rules:
  # Pre-compute HTTP request rate
  - record: job:http_requests:rate5m
    expr: rate(http_requests_total[5m])

  # Pre-compute error rate
  - record: job:http_errors:rate5m
    expr: rate(http_requests_total{status=~"5.."}[5m])

  # Pre-compute CPU usage
  - record: instance:node_cpu:avg_rate5m
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

🚨 Common Monitoring Scenarios

1. Website/API Monitoring

# HTTP request rate (requests per second)
rate(http_requests_total{status="200"}[5m])

# Error rate percentage
(rate(http_requests_total{status!~"2.."}[5m]) / rate(http_requests_total[5m])) * 100

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

2. Database Monitoring

# Active connections
pg_stat_database_numbackends{datname="mydb"}

# Query performance (transactions per second)
rate(pg_stat_database_xact_commit[5m])

# Cache hit ratio
(sum(pg_stat_database_blks_hit) / (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read))) * 100

# Slow queries
pg_stat_statements_mean_exec_time_seconds > 1

3. Container Monitoring

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100

# Container memory usage
container_memory_usage_bytes{name=~".+"}

# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

# Container restart count
kube_pod_container_status_restarts_total

4. System Monitoring

# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

📚 4-Week Learning Path

Week 1: Basics & Installation

  • ✅ Install Prometheus & Grafana with Docker
  • ✅ Understand metrics types (Counter, Gauge, Histogram, Summary)
  • ✅ Configure prometheus.yml
  • ✅ Add Node Exporter
  • Project: Monitor your local machine

Week 2: Metrics Collection & PromQL

  • ✅ Learn PromQL basics (rate, sum, avg)
  • ✅ Add multiple exporters (cAdvisor, Blackbox)
  • ✅ Practice queries in Prometheus UI
  • ✅ Understand labels and filtering
  • Project: Monitor Docker containers

Week 3: Grafana Dashboards

  • ✅ Import pre-built dashboards
  • ✅ Create custom dashboards
  • ✅ Master panel types (Graph, Stat, Table, Heatmap)
  • ✅ Use variables for dynamic dashboards
  • Project: Build application monitoring dashboard

Week 4: Alerting & Production

  • ✅ Configure AlertManager
  • ✅ Create alert rules
  • ✅ Set up notification channels (Slack, Email, PagerDuty)
  • ✅ Implement recording rules
  • ✅ Production best practices
  • Project: Complete monitoring stack with alerts

💼 Career Impact

RoleWith Monitoring Skills
Junior DevOps₹10-15 LPA
Mid-level₹18-28 LPA
Senior/Architect₹30-45 LPA

💼 Career Impact 2025

Junior DevOps Engineer

Salary: ₹10-15 LPA

With monitoring skills: Basic dashboard creation, alert setup

Mid-Level SRE/DevOps

Salary: ₹18-28 LPA

Advanced PromQL, custom exporters, complex alerting

Senior/Architect

Salary: ₹30-45 LPA

Multi-cluster monitoring, observability strategy, SLO/SLI design

✅ Quick Start Checklist

Week 1-2:

  • ☐ Install Prometheus & Grafana
  • ☐ Add Node Exporter
  • ☐ Import pre-built dashboards
  • ☐ Learn basic PromQL queries
  • ☐ Monitor local system

Week 3-4:

  • ☐ Create custom dashboards
  • ☐ Set up basic alerts
  • ☐ Add application metrics
  • ☐ Configure AlertManager
  • ☐ Build production monitoring

Begin Monitoring: Your first metric is just a docker-compose up away!

Remember: You can't improve what you don't measure. Start monitoring today!

🚀 Ready to Master Monitoring & DevOps?

Join our DevOps Master Program with hands-on Prometheus & Grafana training

85%
Placement Rate
₹12-18L
Average Package
200+
Hours Training

✅ Hands-on Projects • ✅ Industry Mentors • ✅ 100% Placement Assistance • ✅ Certification Prep

🎓 Next Batch Starts: December 13, 2025

Only 15 seats remaining!

Enroll Now - Limited Seats