1. 部署阶段常见问题及解决方案
a. Pod 部署报错
问题症状:
ImagePullBackOff: 镜像拉取失败
Pending: 资源不足或存储问题
CrashLoopBackOff: 启动失败循环
诊断命令:
# 查看详细错误信息
kubectl describe pod <pod-name> -n <namespace>
kubectl get events --sort-by='.lastTimestamp' --field-selector=involvedObject.name=<pod-name>
# 检查镜像相关
kubectl get pods -o wide | grep -E "(ImagePullBackOff|ErrImagePull|InvalidImageName)"
# 检查资源情况
kubectl top nodes
kubectl describe node <node-name> | grep -A 10 -B 5 "Allocatable"
# 检查存储状态
kubectl get pvc
kubectl get pv
kubectl describe pvc <pvc-name>
# 完整的部署配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: tomcat-deployment
namespace: tomcat-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
replicas: 3
revisionHistoryLimit: 3
selector:
matchLabels:
app: tomcat
version: v1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: tomcat
version: v1
spec:
# 反亲和性,避免Pod集中在同一节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["tomcat"]
topologyKey: kubernetes.io/hostname
# 节点选择约束
tolerations:
- key: "dedicated"
operator: "Equal"
value: "tomcat"
effect: "NoSchedule"
containers:
- name: tomcat
image: tomcat:8.5.93-jdk8-corretto@sha256:<具体sha256值>
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: http
- containerPort: 8000
name: debug
# 资源限制
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
# 环境变量配置
env:
- name: JAVA_OPTS
value: "-Xms512m -Xmx1024m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGC -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/heapdump"
- name: SPRING_PROFILES_ACTIVE
value: "prod"
# 健康检查
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30
periodSeconds: 10
# 卷挂载
volumeMounts:
- name: tomcat-storage
mountPath: /usr/local/tomcat/webapps
- name: gclog
mountPath: /opt/gclog
- name: heapdump
mountPath: /opt/heapdump
- name: app-config
mountPath: /app/config
readOnly: true
# 安全上下文
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
# 卷定义
volumes:
- name: tomcat-storage
persistentVolumeClaim:
claimName: tomcat-pvc
- name: gclog
emptyDir:
sizeLimit: 100Mi
- name: heapdump
emptyDir:
sizeLimit: 2Gi
- name: app-config
configMap:
name: tomcat-config
# 服务账户
serviceAccountName: tomcat-service-account
# 重启策略
restartPolicy: Always
---
# Service 配置
apiVersion: v1
kind: Service
metadata:
name: tomcat-service
namespace: tomcat-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
type: NodePort
selector:
app: tomcat
version: v1
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 30080
- name: metrics
port: 8081
targetPort: 8081
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
---
# Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tomcat-ingress
namespace: tomcat-app
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "route"
nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
ingressClassName: nginx
rules:
- host: tomcat.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tomcat-service
port:
number: 8080
tls:
- hosts:
- tomcat.example.com
secretName: tomcat-tls
b. Service/Ingress 访问问题
诊断命令:
# 检查服务发现
kubectl get endpoints <service-name>
kubectl describe service <service-name>
# 检查Ingress
kubectl describe ingress <ingress-name>
kubectl get ingress -o wide
# 网络诊断
kubectl run network-check --rm -it --image=nicolaka/netshoot -- bash
# 在netshoot容器中执行:
dig tomcat-service.tomcat-app.svc.cluster.local
curl -v http://tomcat-service:8080
2.业务运行期问题解决方案
a. 内存持续上涨 & OOM 问题
监控命令:
# 实时监控
kubectl top pods --containers --use-protocol-buffers
kubectl top nodes --use-protocol-buffers
# 内存详细分析
kubectl exec <pod-name> -- pmap -x 1 | head -20
kubectl exec <pod-name> -- jstat -gc <pid> 1000 10
# 生成堆转储并分析
kubectl exec <pod-name> -- jmap -dump:live,format=b,file=/opt/heapdump/heap.hprof 1
kubectl cp <namespace>/<pod-name>:/opt/heapdump/heap.hprof ./heap.hprof
JVM 优化参数:
env:
- name: JAVA_OPTS
value: >
-Xms512m
-Xmx1024m
-XX:MaxMetaspaceSize=256m
-XX:ReservedCodeCacheSize=128m
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1ReservePercent=15
-XX:ConcGCThreads=2
-XX:ParallelGCThreads=4
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-Xloggc:/opt/gclog/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=10M
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/heapdump
-XX:ErrorFile=/opt/heapdump/hs_err_pid%p.log
-XX:+CrashOnOutOfMemoryError
b. 僵尸进程问题
健康检查增强配置:
# 多维度健康检查
livenessProbe:
exec:
command:
- sh
- -c
- |
# 检查进程是否存在
ps aux | grep java | grep -v grep || exit 1
# 检查端口监听
netstat -tln | grep :8080 || exit 1
# 检查应用健康端点
curl -f http://localhost:8080/actuator/health || exit 1
initialDelaySeconds: 45
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
httpHeaders:
- name: X-Readiness-Check
value: "true"
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
3. 高可用和弹性伸缩配置
HPA 自动伸缩
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tomcat-hpa
namespace: tomcat-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tomcat-deployment
minReplicas: 2
maxReplicas: 10
behavior:
scaleUp:
policies:
- type: Pods
value: 2
periodSeconds: 60
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: Max
stabilizationWindowSeconds: 0
scaleDown:
policies:
- type: Pods
value: 1
periodSeconds: 300
- type: Percent
value: 10
periodSeconds: 300
selectPolicy: Max
stabilizationWindowSeconds: 300
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
4. 全链路监控配置
Prometheus 监控规则的编写
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: tomcat-alerts
namespace: tomcat-app
labels:
prometheus: k8s
role: alert-rules
spec:
groups:
- name: tomcat
rules:
- alert: TomcatMemoryUsageHigh
expr: (container_memory_working_set_bytes{pod=~"tomcat-.*", container="tomcat"} / container_spec_memory_limit_bytes{pod=~"tomcat-.*", container="tomcat"}) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Tomcat memory usage high"
description: "Pod {{ $labels.pod }} memory usage is {{ $value }}%"
- alert: TomcatDown
expr: up{pod=~"tomcat-.*"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Tomcat pod down"
description: "Pod {{ $labels.pod }} is down"