Joe Blog

最近在Slack的prometheus 告警通道经常收到etcdHighNumberOfFailedGRPCRequests 这个条告警信息

登录到这个机器上查看log 相近时间的日志也没看到什么异常信息

就只能求助google 应该有人也会碰到一样的问题

查到redhat的bugzilla下的一个issus

https://bugzilla.redhat.com/show_bug.cgi?id=1701154

以及openshift下一个PR

https://github.com/openshift/cluster-monitoring-operator/pull/340/files#diff-1

提供一个短期解决方案

移除这条etcd告警的rules

由于Helm安装的Prometheus-operator 需要在template里找到相应的模板位置

.../templates/prometheus/rules/etcd.yaml

...
44     - alert: etcdHighNumberOfFailedGRPCRequests
 45       annotations:
 46         message: 'etcd cluster "`}}": `}}% of requests for `}} failed on etcd instance `}}.'
 47       expr: |-
 48         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
 49           /
 50         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
 51           > 1
 52       for: 10m
 53       labels:
 54         severity: warning
 55     - alert: etcdHighNumberOfFailedGRPCRequests
 56       annotations:
 57         message: 'etcd cluster "`}}": `}}% of requests for `}} failed on etcd instance `}}.'
 58       expr: |-
 59         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
 60           /
 61         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
 62           > 5
 63       for: 5m
 64       labels:
 65         severity: critical

...


然后

root@k8s-master-1:~/k8s_manifests/helm-prometheus-operator# helm upgrade prometheus-operator .