Joe Blog

日常我们对于Kubernetes环境下的系统及业务的监控会比较关注,有主机类针对宿主机Node的运行状态、有业务类针对应用性能的APM链路追踪等,往往很容易忽略证书类的有效期限的监控,这篇博客的内容就是针对随着时间的流逝,这些SSL证书很有可能会过期,从而影响Kubernetes系统的日常加密通信。

我们将采用一种简单的Prometheus和node-cert-exporter检查SSL证书的到期日期,并在Grafana中可视化这些日期数据,并配置相应的告警信息提醒,做到提前更新这些过期的SSL证书,保障系统及业务的正常运行。

前提

  1. Kubernetes v1.16+

  2. Prometheus

  3. Grafana

  4. AlertManager

    以上组件准备就绪

Node-cert-exporter

DaemonSet的形式在Kubernetes集群环境运行, node-cert-exporterserviceTypeClusterIP

---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
spec:
  finalizers:
  - kubernetes
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-cert-exporter
  namespace: monitoring
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: node-cert-exporter
  name: node-cert-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-cert-exporter
  template:
    metadata:
      name: node-cert-exporter
      labels:
        app: node-cert-exporter
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9117'
    spec:
      containers:
      - image: amimof/node-cert-exporter:latest
        args:
        - "--v=2"
        - "--logtostderr=true"
        - "--path=/host/etc/ssl,/host/etc/kubernetes/pki/,/host/etc/kubernetes/ssl"
        env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        name: node-cert-exporter
        ports:
        - containerPort: 9117
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 100m
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 128Mi
        volumeMounts:
        - mountPath: /host/etc
          name: etc
          readOnly: true
      serviceAccount: node-cert-exporter
      serviceAccountName: node-cert-exporter
      volumes:
      - hostPath:
          path: /etc
          type: ""
        name: etc
  updateStrategy:
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: node-cert-exporter
  name: node-cert-exporter
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 9117
    protocol: TCP
    targetPort: 9117
  selector:
    app: node-cert-exporter
  type: ClusterIP
root@node1:~# kubectl get daemonset,svc,pods -n monitoring -l app=node-cert-exporter
NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/node-cert-exporter   4         4         4       4            4           <none>          4h34m

NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/node-cert-exporter   ClusterIP   10.244.35.128   <none>        9117/TCP   5h36m

NAME                           READY   STATUS    RESTARTS   AGE
pod/node-cert-exporter-lspw7   1/1     Running   0          4h34m
pod/node-cert-exporter-pr62m   1/1     Running   0          4h34m
pod/node-cert-exporter-qrbs5   1/1     Running   0          4h34m
pod/node-cert-exporter-tnnrc   1/1     Running   0          4h34m

Prometheus

自定义serviceMonitor文件

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: node-cert-exporter
    release: prom
  name: node-cert-exporter
  namespace: monitoring
spec:
  endpoints:
  - path: /metrics
    port: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: node-cert-exporter

这边有一个注意事项,需要在.metadata,labels注释一个release:prom 这个是根据Prometheus的serviceMonitorSelector选择器的标签匹配有关

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
... 省略部分
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      release: prom
...

说明node-cert-exporter/metrics接口能正常被Prometheus的抓取到,Prometheus这边的服务发现正常

创建Grafana仪表盘

导入Grafana仪表板

查找grafana Dashboard Id

导入Dashboard ID为9999的仪表盘

最终展示的Dashboard仪表盘

SSL证书到期警报

添加PrometheusRule 告警规则node-cert-exporter-prometheusrule.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: prometheus-operator
    chart: prometheus-operator-8.15.11
    heritage: Helm
    release: prom
  name: prom-prometheus-operator-node-cert-exporter
  namespace: monitoring
spec:
  groups:
  - name: node-cert-exporter
    rules:
    - alert: SSLCertExpiringSoon
      expr: sum(ssl_certificate_expiry_seconds{}) by (instance, path) < 86400 * 30
      for: 1h
      labels:
        severity: warning

设置Alert告警消息机制 (这边通过slack 消息发送)

global:
  resolve_timeout: 5m
receivers:
- name: "null"
- name: slack_webhook
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXXXXX/XXXXXXXXXXXXXXXXXX
    channel: alertmanager
    color: 'dangergood'
    fallback: ''
    icon_emoji: ''
    icon_url: ''
    pretext: ''
    send_resolved: false
    text: |-
      
        *Alert:* `` - ``
        *Details:*
         • *:* ``
        
      
    title: ''
    title_link: ''
    username: ''
route:
  group_by:
  - alertname
  - cluster
  - service
  group_interval: 5m
  group_wait: 30s
  receiver: slack_webhook
  repeat_interval: 15m
  routes:
  - match:
      alertname: Watchdog
    receiver: "null"
  - match:
      alertname: CPUThrottlingHigh
    receiver: "null"
  - match:
      alertname: ^(host_cpu_usage|node_filesystem_free|host_down|TargetDown)$
    receiver: slack_webhook
  - match:
      alertname: SSLCertExpiringSoon
    receiver: slack_webhook
templates:
- /etc/alertmanager/config/*.tmpl

现在,当您SSL相关证书还有一个月以内的有效期就会通过Prometheus监控和AlertManager告警通过Slack消息通知到你。