01 - Kubernetes 监控：Prometheus + Grafana

监控全景

┌─────────────────────────────────────────────────────────┐
│                    可观测性三支柱                          │
│                                                         │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐       │
│  │   Metrics   │ │   Logging   │ │   Tracing   │       │
│  │   指标      │ │   日志      │ │   链路追踪   │       │
│  │             │ │             │ │             │       │
│  │ Prometheus  │ │ EFK / Loki  │ │ Jaeger /    │       │
│  │ + Grafana   │ │             │ │ Zipkin      │       │
│  └─────────────┘ └─────────────┘ └─────────────┘       │
└─────────────────────────────────────────────────────────┘

Prometheus 架构

┌──────────────────────────────────────────────────────┐
│                                                      │
│  ┌─────────────┐    Pull 指标     ┌───────────────┐  │
│  │ Prometheus  │ ◄───────────── │  K8s Nodes    │  │
│  │ Server      │                 │  kubelet      │  │
│  │             │ ◄───────────── │  cAdvisor     │  │
│  │  时序数据库  │                 └───────────────┘  │
│  │  PromQL     │                                    │
│  │  告警规则   │ ◄───────────── ┌───────────────┐  │
│  └──────┬──────┘    Pull 指标   │  应用 Pod     │  │
│         │                       │  /metrics     │  │
│         │                       └───────────────┘  │
│         │                                          │
│    ┌────▼────┐     ┌────────────┐                  │
│    │Grafana  │     │AlertManager│                  │
│    │可视化    │     │告警通知     │                  │
│    └─────────┘     └────────────┘                  │
└──────────────────────────────────────────────────────┘

核心概念

Pull 模型：Prometheus 主动拉取指标（vs Push 模型）
时序数据：数据带有时间戳，如 CPU 使用率随时间变化
PromQL：Prometheus 查询语言
Exporter：将应用指标暴露为 Prometheus 格式

安装 kube-prometheus-stack

使用 Helm 一键安装（包含 Prometheus、Grafana、AlertManager、各种 Exporter）：

bash

# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 安装
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123

# 查看安装的组件
kubectl get all -n monitoring

访问 Grafana

bash

# 端口转发
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

# 浏览器打开 http://localhost:3000
# 用户名: admin / 密码: admin123

内置仪表盘

安装后自带大量仪表盘：

Kubernetes / Compute Resources / Cluster：集群总览
Kubernetes / Compute Resources / Namespace (Pods)：命名空间资源
Kubernetes / Compute Resources / Pod：单个 Pod 详情
Node Exporter / Nodes：节点级别指标

PromQL 常用查询

promql

# 节点 CPU 使用率
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Pod 内存使用
container_memory_working_set_bytes{namespace="default"}

# Pod CPU 使用率
rate(container_cpu_usage_seconds_total{namespace="default"}[5m])

# HTTP 请求速率
rate(http_requests_total[5m])

# HTTP 错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Pod 重启次数
kube_pod_container_status_restarts_total

告警规则

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
  - name: app
    rules:
    - alert: HighPodRestarts
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} 在 1 小时内重启超过 5 次"
    - alert: HighMemoryUsage
      expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} 内存使用率超过 90%"

应用自定义指标

在应用中暴露 /metrics 端点：

python

# Python 示例 (prometheus_client)
from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response

app = Flask(__name__)

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

@app.route("/metrics")
def metrics():
    return Response(generate_latest(), mimetype="text/plain")

配置 Prometheus 自动发现（通过 ServiceMonitor）：

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: http
    path: /metrics
    interval: 15s
  namespaceSelector:
    matchNames:
    - default

下一步

→ 02 - 日志管理

01 - Kubernetes 监控：Prometheus + Grafana ​

监控全景 ​

Prometheus 架构 ​

核心概念 ​

安装 kube-prometheus-stack ​

访问 Grafana ​

内置仪表盘 ​

PromQL 常用查询 ​

告警规则 ​

应用自定义指标 ​

下一步 ​

01 - Kubernetes 监控：Prometheus + Grafana

监控全景

Prometheus 架构

核心概念

安装 kube-prometheus-stack

访问 Grafana

内置仪表盘

PromQL 常用查询

告警规则

应用自定义指标

下一步