Prometheus & Grafana Monitoring Guide 2025
🚀 Why This Stack in 2025?
Prometheus + Grafana is the industry-standard monitoring stack with 75% Kubernetes adoption. Together they provide complete observability - metrics collection, visualization, and alerting.
Quick Stats:
- ✅ 60% faster incident detection
- ✅ 45% reduction in production issues
- ✅ Salary Impact: Monitoring skills add ₹5-15 LPA
- ✅ 75% adoption in Kubernetes environments
- ✅ Open-source and completely free
📦 Quick Installation
Using Docker Compose:
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
restart: unless-stopped
volumes:
prometheus_data:{}
grafana_data:{}# Start the stack
docker-compose up -d
# Check status
docker-compose ps
# View logs
docker-compose logs -f🌐 Access URLs:
- 🎨 Grafana:
http://localhost:3000(admin/admin) - 📊 Prometheus:
http://localhost:9090 - 💻 Node Exporter:
http://localhost:9100/metrics
🎯 Core Concepts
📊 Prometheus:
- ✅ Metrics: Time-series data collection
- ✅ PromQL: Powerful query language
- ✅ Exporters: Collect metrics from servers
- ✅ AlertManager: Handle alerts
- ✅ Service Discovery: Auto-detect targets
- ✅ Pull Model: Scrapes metrics periodically
🎨 Grafana:
- ✅ Dashboards: Visualize metrics beautifully
- ✅ Panels: Graphs, stats, tables, heatmaps
- ✅ Data Sources: Connect multiple sources
- ✅ Alerts: Set notification rules
- ✅ Variables: Dynamic dashboards
- ✅ Plugins: Extend functionality
📊 Prometheus Configuration
prometheus.yml Configuration:
# Global configuration
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
external_labels:
cluster: 'production'
region: 'us-east-1'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Load rules once and periodically evaluate them
rule_files:
- "alerts.yml"
- "recording_rules.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (System metrics)
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
env: 'production'
# Application metrics
- job_name: 'application'
static_configs:
- targets: ['app:8080']
labels:
service: 'web-api'
# Docker containers
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']🖥️ Essential Exporters
1. Node Exporter (System Metrics)
Collects hardware and OS metrics: CPU, memory, disk, network
# Docker
docker run -d \
--name=node-exporter \
-p 9100:9100 \
prom/node-exporter
# Verify
curl http://localhost:9100/metrics2. cAdvisor (Container Metrics)
Monitors Docker containers: CPU, memory, network, filesystem
# Docker
docker run -d \
--name=cadvisor \
-p 8080:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:ro \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
google/cadvisor:latest3. Blackbox Exporter (Endpoint Probing)
Probes HTTP, HTTPS, DNS, TCP, ICMP endpoints
# Docker
docker run -d \
--name=blackbox-exporter \
-p 9115:9115 \
prom/blackbox-exporter
# Probe HTTP endpoint
curl "http://localhost:9115/probe?target=https://example.com&module=http_2xx"4. MySQL Exporter (Database Metrics)
Monitors MySQL/MariaDB performance
# Docker
docker run -d \
--name=mysql-exporter \
-p 9104:9104 \
-e DATA_SOURCE_NAME="user:password@(mysql:3306)/" \
prom/mysqld-exporter📈 Grafana Dashboard Setup
Step 1: Add Prometheus Data Source
- 1. Open Grafana:
http://localhost:3000 - 2. Login with admin/admin (change password)
- 3. Go to: Configuration → Data Sources → Add data source
- 4. Select: Prometheus
- 5. URL:
http://prometheus:9090 - 6. Click: Save & Test (should show green checkmark)
Step 2: Import Pre-built Dashboards
Popular Dashboard IDs:
- 🖥️ 1860 - Node Exporter Full (System metrics)
- ☸️ 315 - Kubernetes Cluster Monitoring
- 🐳 893 - Docker Container & Host Metrics
- 🌐 7587 - Nginx Monitoring
- 🗄️ 7362 - MySQL Overview
- 📊 3662 - Prometheus 2.0 Stats
Import Steps:
- 1. Click + icon → Import
- 2. Enter Dashboard ID (e.g., 1860)
- 3. Click Load
- 4. Select Prometheus data source
- 5. Click Import
- 6. Dashboard ready instantly! 🎉
🔔 Setting Up Alerts
Prometheus Alert Rules (alerts.yml):
groups:
- name: system_alerts
rules:
# High CPU Alert
- alert: HighCPULoad
expr: node_load1 > 2
for: 5m
labels:
severity: warning
team: devops
annotations:
summary: "High CPU load on {{ $labels.instance }} "
description: "CPU load is {{ $value }} (threshold: 2)"
# High Memory Usage
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }} "
description: "Memory usage is {{ $value }} % (threshold: 80%)"
# Disk Space Low
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }} "
# Service Down
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"Grafana Alerting Setup:
- 1. Create Alert Rule:
- • Go to Alerting → Alert rules → New alert rule
- • Set query and condition (e.g., CPU > 80%)
- • Define evaluation interval
- 2. Configure Contact Points:
- • Alerting → Contact points → New contact point
- • Add Slack, Email, PagerDuty, Webhook
- • Test notification
- 3. Create Notification Policy:
- • Define routing based on labels
- • Set grouping and timing
- • Configure escalation
🐳 Kubernetes Monitoring
Install with Helm:
helm install prometheus prometheus-community/prometheus
helm install grafana grafana/grafanaMonitor Pods:
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage
container_memory_working_set_bytes{pod=~".+"} 💡 Pro Tips for 2025
Best Practices:
- ✅ Use Recording Rules: Pre-compute expensive queries
- ✅ Label Consistently: Use standard naming conventions
- ✅ Set Retention: 15-30 days for metrics (balance storage vs history)
- ✅ Monitor Prometheus: Watch its own resource usage
- ✅ Use rate() for counters: Always use rate() or irate() for counter metrics
- ✅ Aggregate wisely: Use sum(), avg(), max() to reduce cardinality
- ✅ Avoid high cardinality: Don't use user IDs or timestamps as labels
- ✅ Use federation: For multi-cluster monitoring
- ✅ Implement service discovery: Auto-discover targets in dynamic environments
- ✅ Backup Grafana: Export dashboards regularly
- ✅ Use variables: Make dashboards reusable with template variables
- ✅ Set up AlertManager: Centralize alert routing and silencing
Recording Rules Example:
# recording_rules.yml
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-compute HTTP request rate
- record: job:http_requests:rate5m
expr: rate(http_requests_total[5m])
# Pre-compute error rate
- record: job:http_errors:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
# Pre-compute CPU usage
- record: instance:node_cpu:avg_rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)🚨 Common Monitoring Scenarios
1. Website/API Monitoring
# HTTP request rate (requests per second)
rate(http_requests_total{status="200"}[5m])
# Error rate percentage
(rate(http_requests_total{status!~"2.."}[5m]) / rate(http_requests_total[5m])) * 100
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))2. Database Monitoring
# Active connections
pg_stat_database_numbackends{datname="mydb"}
# Query performance (transactions per second)
rate(pg_stat_database_xact_commit[5m])
# Cache hit ratio
(sum(pg_stat_database_blks_hit) / (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read))) * 100
# Slow queries
pg_stat_statements_mean_exec_time_seconds > 13. Container Monitoring
# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100
# Container memory usage
container_memory_usage_bytes{name=~".+"}
# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])
# Container restart count
kube_pod_container_status_restarts_total4. System Monitoring
# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network traffic
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])📚 4-Week Learning Path
Week 1: Basics & Installation
- ✅ Install Prometheus & Grafana with Docker
- ✅ Understand metrics types (Counter, Gauge, Histogram, Summary)
- ✅ Configure prometheus.yml
- ✅ Add Node Exporter
- ✅ Project: Monitor your local machine
Week 2: Metrics Collection & PromQL
- ✅ Learn PromQL basics (rate, sum, avg)
- ✅ Add multiple exporters (cAdvisor, Blackbox)
- ✅ Practice queries in Prometheus UI
- ✅ Understand labels and filtering
- ✅ Project: Monitor Docker containers
Week 3: Grafana Dashboards
- ✅ Import pre-built dashboards
- ✅ Create custom dashboards
- ✅ Master panel types (Graph, Stat, Table, Heatmap)
- ✅ Use variables for dynamic dashboards
- ✅ Project: Build application monitoring dashboard
Week 4: Alerting & Production
- ✅ Configure AlertManager
- ✅ Create alert rules
- ✅ Set up notification channels (Slack, Email, PagerDuty)
- ✅ Implement recording rules
- ✅ Production best practices
- ✅ Project: Complete monitoring stack with alerts
💼 Career Impact
| Role | With Monitoring Skills |
|---|---|
| Junior DevOps | ₹10-15 LPA |
| Mid-level | ₹18-28 LPA |
| Senior/Architect | ₹30-45 LPA |
💼 Career Impact 2025
Junior DevOps Engineer
Salary: ₹10-15 LPA
With monitoring skills: Basic dashboard creation, alert setup
Mid-Level SRE/DevOps
Salary: ₹18-28 LPA
Advanced PromQL, custom exporters, complex alerting
Senior/Architect
Salary: ₹30-45 LPA
Multi-cluster monitoring, observability strategy, SLO/SLI design
✅ Quick Start Checklist
Week 1-2:
- ☐ Install Prometheus & Grafana
- ☐ Add Node Exporter
- ☐ Import pre-built dashboards
- ☐ Learn basic PromQL queries
- ☐ Monitor local system
Week 3-4:
- ☐ Create custom dashboards
- ☐ Set up basic alerts
- ☐ Add application metrics
- ☐ Configure AlertManager
- ☐ Build production monitoring
Begin Monitoring: Your first metric is just a docker-compose up away!
Remember: You can't improve what you don't measure. Start monitoring today!
🚀 Ready to Master Monitoring & DevOps?
Join our DevOps Master Program with hands-on Prometheus & Grafana training
✅ Hands-on Projects • ✅ Industry Mentors • ✅ 100% Placement Assistance • ✅ Certification Prep
🎓 Next Batch Starts: December 13, 2025
Only 15 seats remaining!