Kubernetes 生产环境全链路问题解决方案-CFANZ编程社区

1. 部署阶段常见问题及解决方案

a. Pod 部署报错

问题症状：

ImagePullBackOff: 镜像拉取失败

Pending: 资源不足或存储问题

CrashLoopBackOff: 启动失败循环

诊断命令：

# 查看详细错误信息
kubectl describe pod <pod-name> -n <namespace>
kubectl get events --sort-by='.lastTimestamp' --field-selector=involvedObject.name=<pod-name>

# 检查镜像相关
kubectl get pods -o wide | grep -E "(ImagePullBackOff|ErrImagePull|InvalidImageName)"

# 检查资源情况
kubectl top nodes
kubectl describe node <node-name> | grep -A 10 -B 5 "Allocatable"

# 检查存储状态
kubectl get pvc
kubectl get pv
kubectl describe pvc <pvc-name>

# 完整的部署配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tomcat-deployment
  namespace: tomcat-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: tomcat
      version: v1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: tomcat
        version: v1
    spec:
      # 反亲和性，避免Pod集中在同一节点
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["tomcat"]
              topologyKey: kubernetes.io/hostname
      
      # 节点选择约束
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "tomcat"
        effect: "NoSchedule"
      
      containers:
      - name: tomcat
        image: tomcat:8.5.93-jdk8-corretto@sha256:<具体sha256值>
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8000
          name: debug
        
        # 资源限制
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        
        # 环境变量配置
        env:
        - name: JAVA_OPTS
          value: "-Xms512m -Xmx1024m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGC -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/heapdump"
        - name: SPRING_PROFILES_ACTIVE
          value: "prod"
        
        # 健康检查
        livenessProbe:
          httpGet:
            path: /actuator/health
            port: 8080
            httpHeaders:
            - name: Custom-Header
              value: Awesome
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        startupProbe:
          httpGet:
            path: /actuator/health
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
        
        # 卷挂载
        volumeMounts:
        - name: tomcat-storage
          mountPath: /usr/local/tomcat/webapps
        - name: gclog
          mountPath: /opt/gclog
        - name: heapdump
          mountPath: /opt/heapdump
        - name: app-config
          mountPath: /app/config
          readOnly: true
        
        # 安全上下文
        securityContext:
          runAsNonRoot: true
          runAsUser: 1000
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
      
      # 卷定义
      volumes:
      - name: tomcat-storage
        persistentVolumeClaim:
          claimName: tomcat-pvc
      - name: gclog
        emptyDir:
          sizeLimit: 100Mi
      - name: heapdump
        emptyDir:
          sizeLimit: 2Gi
      - name: app-config
        configMap:
          name: tomcat-config
      
      # 服务账户
      serviceAccountName: tomcat-service-account
      
      # 重启策略
      restartPolicy: Always

---
# Service 配置
apiVersion: v1
kind: Service
metadata:
  name: tomcat-service
  namespace: tomcat-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  type: NodePort
  selector:
    app: tomcat
    version: v1
  ports:
  - name: http
    port: 8080
    targetPort: 8080
    nodePort: 30080
  - name: metrics
    port: 8081
    targetPort: 8081
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

---
# Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: tomcat-ingress
  namespace: tomcat-app
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "route"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  ingressClassName: nginx
  rules:
  - host: tomcat.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: tomcat-service
            port:
              number: 8080
  tls:
  - hosts:
    - tomcat.example.com
    secretName: tomcat-tls

b. Service/Ingress 访问问题

诊断命令：

# 检查服务发现
kubectl get endpoints <service-name>
kubectl describe service <service-name>

# 检查Ingress
kubectl describe ingress <ingress-name>
kubectl get ingress -o wide

# 网络诊断
kubectl run network-check --rm -it --image=nicolaka/netshoot -- bash
# 在netshoot容器中执行：
dig tomcat-service.tomcat-app.svc.cluster.local
curl -v http://tomcat-service:8080

2.业务运行期问题解决方案

a. 内存持续上涨 & OOM 问题

监控命令：

# 实时监控
kubectl top pods --containers --use-protocol-buffers
kubectl top nodes --use-protocol-buffers

# 内存详细分析
kubectl exec <pod-name> -- pmap -x 1 | head -20
kubectl exec <pod-name> -- jstat -gc <pid> 1000 10

# 生成堆转储并分析
kubectl exec <pod-name> -- jmap -dump:live,format=b,file=/opt/heapdump/heap.hprof 1
kubectl cp <namespace>/<pod-name>:/opt/heapdump/heap.hprof ./heap.hprof

JVM 优化参数：

env:
- name: JAVA_OPTS
  value: >
    -Xms512m
    -Xmx1024m
    -XX:MaxMetaspaceSize=256m
    -XX:ReservedCodeCacheSize=128m
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -XX:InitiatingHeapOccupancyPercent=35
    -XX:G1ReservePercent=15
    -XX:ConcGCThreads=2
    -XX:ParallelGCThreads=4
    -XX:+PrintGC
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -XX:+PrintGCTimeStamps
    -Xloggc:/opt/gclog/gc.log
    -XX:+UseGCLogFileRotation
    -XX:NumberOfGCLogFiles=5
    -XX:GCLogFileSize=10M
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:HeapDumpPath=/opt/heapdump
    -XX:ErrorFile=/opt/heapdump/hs_err_pid%p.log
    -XX:+CrashOnOutOfMemoryError

b. 僵尸进程问题

健康检查增强配置：

# 多维度健康检查
livenessProbe:
  exec:
    command:
    - sh
    - -c
    - |
      # 检查进程是否存在
      ps aux | grep java | grep -v grep || exit 1
      # 检查端口监听
      netstat -tln | grep :8080 || exit 1
      # 检查应用健康端点
      curl -f http://localhost:8080/actuator/health || exit 1
  initialDelaySeconds: 45
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
    httpHeaders:
    - name: X-Readiness-Check
      value: "true"
  initialDelaySeconds: 15
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

3. 高可用和弹性伸缩配置

HPA 自动伸缩

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tomcat-hpa
  namespace: tomcat-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tomcat-deployment
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Max
      stabilizationWindowSeconds: 0
    scaleDown:
      policies:
      - type: Pods
        value: 1
        periodSeconds: 300
      - type: Percent
        value: 10
        periodSeconds: 300
      selectPolicy: Max
      stabilizationWindowSeconds: 300
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000

4. 全链路监控配置

Prometheus 监控规则的编写

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tomcat-alerts
  namespace: tomcat-app
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: tomcat
    rules:
    - alert: TomcatMemoryUsageHigh
      expr: (container_memory_working_set_bytes{pod=~"tomcat-.*", container="tomcat"} / container_spec_memory_limit_bytes{pod=~"tomcat-.*", container="tomcat"}) > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Tomcat memory usage high"
        description: "Pod {{ $labels.pod }} memory usage is {{ $value }}%"
    
    - alert: TomcatDown
      expr: up{pod=~"tomcat-.*"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Tomcat pod down"
        description: "Pod {{ $labels.pod }} is down"