Kubernetes 生产环境实战:从集群规划到故障恢复完整指南

Lear 2025-08-04 13:00:00
Categories: > Tags:

Kubernetes 生产环境实战:从集群规划到故障恢复完整指南


💡 前言: 作为一名在生产环境管理超过100个K8s集群的运维老兵,我踩过的坑可能比你见过的还要多。这篇文章将毫无保留地分享我在实际项目中总结的血泪经验,帮你避开那些让人头疼的生产事故。


🎯 为什么这篇文章值得收藏?






🏗️ 第一章:集群规划 - 架构决定命运


1.1 硬件资源规划的黄金法则


❌ 常见误区:


1
2
3
# 很多人这样规划资源  
Master节点:2C4G × 3
Worker节点:4C8G × 5

✅ 生产环境最佳实践:


1
2
3
4
# 基于业务负载的科学规划  
Master节点:4C8G × 3 (奇数个,避免脑裂)
Worker节点:8C16G × N (根据业务峰值 × 1.5倍规划)
ETCD独立部署:2C4G × 3 (SSD存储必备)

💰 成本优化技巧:





1.2 网络架构设计要点


Flannel vs Calico vs Cilium 选型对比:


网络插件 性能 安全策略 复杂度 推荐场景
Flannel ⭐⭐⭐ 小型集群,快速部署
Calico ⭐⭐⭐⭐ ⭐⭐⭐ 中大型集群,需要网络策略
Cilium ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ 高性能要求,云原生环境

🔥 生产环境推荐配置(Calico):


1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: operator.tigera.io/v1  
kind: Installation
metadata:
name: default
spec:
calicoNetwork:
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
encapsulation: VXLAN
natOutgoing: Enabled
nodeMetricsPort: 9091

⚙️ 第二章:集群部署 - 自动化是王道


2.1 使用Kubeadm快速部署


🚀 一键部署脚本(亲测可用):


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#!/bin/bash  
# K8s集群自动化部署脚本 v2.0
set -e

# 环境检查
check_requirements() {
echo "🔍 检查系统环境..."

# 检查操作系统
if [[ ! -f /etc/redhat-release ]] && [[ ! -f /etc/debian_version ]]; then
echo "❌ 仅支持 CentOS/RHEL 或 Ubuntu 系统"
exit 1
fi

# 检查内存
total_mem=$(free -m | awk 'NR==2{printf "%.0f", $2}')
if [[ $total_mem -lt 2048 ]]; then
echo "⚠️ 警告:内存小于2GB,可能影响集群稳定性"
fi

echo "✅ 环境检查通过"
}

# 系统初始化
init_system() {
echo "🛠️ 初始化系统配置..."

# 关闭防火墙和SELinux
systemctl disable --now firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# 关闭swap
swapoff -a
sed -i '/ swap / s/^\(.\*\)$/#\1/g' /etc/fstab

# 加载内核模块
cat <<EOF | tee /etc/modules-load.d/k8s.conf
br_netfilter
overlay
EOF

modprobe br_netfilter
modprobe overlay

# 配置内核参数
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

sysctl --system
echo "✅ 系统初始化完成"
}

# 安装容器运行时
install_containerd() {
echo "📦 安装 containerd..."

# 安装依赖
yum install -y yum-utils device-mapper-persistent-data lvm2

# 添加Docker仓库
yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# 安装containerd
yum install -y containerd.io

# 配置containerd
mkdir -p /etc/containerd
containerd config default | tee /etc/containerd/config.toml

# 使用systemd cgroup driver
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

# 配置镜像加速器
sed -i 's|registry.k8s.io|registry.aliyuncs.com/google_containers|g' /etc/containerd/config.toml

systemctl enable --now containerd
echo "✅ containerd 安装完成"
}

# 安装kubeadm、kubelet、kubectl
install_kubernetes() {
echo "🎯 安装 Kubernetes 组件..."

cat <<EOF | tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF

yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
systemctl enable --now kubelet

echo "✅ Kubernetes 组件安装完成"
}

# 主节点调用
main() {
check_requirements
init_system
install_containerd
install_kubernetes

echo "🎉 集群基础环境准备完成!"
echo "📋 接下来执行:"
echo " Master节点: kubeadm init --config=kubeadm-config.yaml"
echo " Worker节点: kubeadm join <master-ip>:6443 --token <token>"
}

main "$@"

2.2 高可用Master节点配置


kubeadm-config.yaml 生产配置模板:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
apiVersion: kubeadm.k8s.io/v1beta3  
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.1.10
bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.2
controlPlaneEndpoint: "k8s-api.example.com:6443"
networking:
serviceSubnet: "10.96.0.0/16"
podSubnet: "10.244.0.0/16"
dnsDomain: "cluster.local"
etcd:
external:
endpoints:
- https://192.168.1.11:2379
- https://192.168.1.12:2379
- https://192.168.1.13:2379
caFile: /etc/kubernetes/pki/etcd/ca.crt
certFile: /etc/kubernetes/pki/etcd/server.crt
keyFile: /etc/kubernetes/pki/etcd/server.key
apiServer:
certSANs:
- "k8s-api.example.com"
- "192.168.1.10"
- "192.168.1.20"
- "192.168.1.30"
extraArgs:
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
audit-log-path: "/var/log/audit.log"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
maxPods: 110

📊 第三章:监控告警 - 防患于未然


3.1 Prometheus + Grafana 黄金组合


🔥 生产级监控栈部署:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# prometheus-values.yaml  
prometheus:
prometheusSpec:
retention: 30d
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 2000m
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
additionalScrapeConfigs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__

grafana:
persistence:
enabled: true
size: 20Gi
storageClassName: fast-ssd
plugins:
- grafana-piechart-panel
- grafana-kubernetes-app
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards

3.2 关键监控指标和告警规则


💀 生产环境必备告警规则:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# critical-alerts.yaml  
groups:
- name: kubernetes-critical
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 5 minutes."

- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently."

- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} has memory pressure"
description: "Node {{ $labels.node }} is under memory pressure."

- alert: EtcdClusterDown
expr: up{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Etcd cluster is down"
description: "Etcd cluster has been down for more than 1 minute."

🛡️ 第四章:安全加固 - 安全无小事


4.1 RBAC权限控制最佳实践


🔐 细粒度权限控制示例:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# developer-rbac.yaml  
apiVersion: v1
kind: Namespace
metadata:
name: development
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: development
name: developer
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: development
name: developer-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: development
subjects:
- kind: ServiceAccount
name: developer
namespace: development
roleRef:
kind: Role
name: developer-role
apiGroup: rbac.authorization.k8s.io

4.2 Pod安全策略配置


🛡️ 强化Pod安全配置:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# pod-security-policy.yaml  
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'

🚨 第五章:故障恢复 - 临危不乱的救命技能


5.1 常见故障诊断流程


🩺 系统性故障诊断checklist:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/bin/bash  
# k8s-health-check.sh - 集群健康检查脚本

echo "🏥 开始Kubernetes集群健康检查..."

# 1. 检查节点状态
echo "📋 1. 检查节点状态"
kubectl get nodes -o wide
echo ""

# 2. 检查系统Pod状态
echo "📋 2. 检查系统Pod状态"
kubectl get pods -n kube-system
echo ""

# 3. 检查存储类
echo "📋 3. 检查存储类"
kubectl get storageclass
echo ""

# 4. 检查网络插件
echo "📋 4. 检查网络插件状态"
kubectl get pods -n kube-system | grep -E "(calico|flannel|weave|cilium)"
echo ""

# 5. 检查API服务器连通性
echo "📋 5. 检查API服务器"
kubectl cluster-info
echo ""

# 6. 检查etcd状态
echo "📋 6. 检查etcd集群状态"
kubectl get pods -n kube-system | grep etcd
echo ""

# 7. 检查资源使用情况
echo "📋 7. 检查资源使用情况"
kubectl top nodes
echo ""

# 8. 检查事件
echo "📋 8. 最近的集群事件"
kubectl get events --sort-by=.metadata.creationTimestamp | tail -10
echo ""

echo "✅ 健康检查完成!"

5.2 etcd备份恢复实战


💾 自动化etcd备份脚本:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash  
# etcd-backup.sh - etcd自动备份脚本

BACKUP_DIR="/opt/etcd-backup"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_NAME="etcd-backup-${DATE}.db"

# 创建备份目录
mkdir -p ${BACKUP_DIR}

# 执行备份
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_DIR}/\${BACKUP_NAME} \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

# 验证备份
ETCDCTL_API=3 etcdctl snapshot status \${BACKUP_DIR}/\${BACKUP_NAME} \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

if [ $? -eq 0 ]; then
echo "✅ etcd备份成功: ${BACKUP_NAME}"

# 清理7天前的备份
find ${BACKUP_DIR} -name "etcd-backup-*.db" -mtime +7 -delete

# 上传到云存储(可选)
# aws s3 cp ${BACKUP_DIR}/${BACKUP_NAME} s3://your-backup-bucket/etcd/
else
echo "❌ etcd备份失败"
exit 1
fi

🔄 etcd紧急恢复步骤:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 1. 停止所有Master节点的etcd和api-server  
systemctl stop etcd kubelet

# 2. 移除旧的etcd数据
mv /var/lib/etcd /var/lib/etcd.backup

# 3. 从备份恢复
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup/etcd-backup-latest.db \
--data-dir=/var/lib/etcd \
--name=master1 \
--initial-cluster=master1=https://192.168.1.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://192.168.1.10:2380

# 4. 修复权限
chown -R etcd:etcd /var/lib/etcd

# 5. 重启服务
systemctl start etcd kubelet

5.3 紧急故障处理手册


🚨 生产环境紧急处理流程:


故障类型 症状 立即处理 根因分析
节点宕机 Node NotReady 1. 检查网络连通性 2. 重启kubelet服务 3. 驱逐Pod到其他节点 硬件故障/网络分区
Pod无法启动 Pending/ImagePullBackOff 1. 检查资源配额 2. 验证镜像地址 3. 检查节点选择器 资源不足/镜像问题
网络故障 服务无法访问 1. 检查网络插件状态 2. 重启网络相关Pod 3. 验证iptables规则 CNI插件异常
存储问题 PVC Pending 1. 检查StorageClass 2. 验证存储后端 3. 清理卡住的PV 存储供应商问题

⚡ 第六章:性能优化 - 让集群飞起来


6.1 API Server调优


🚀 高并发API Server配置:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# kube-apiserver优化参数  
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
- --advertise-address=192.168.1.10
- --max-requests-inflight=2000
- --max-mutating-requests-inflight=1000
- --default-watch-cache-size=500
- --watch-cache-sizes=nodes#1000,pods#5000
- --enable-admission-plugins=NodeRestriction,LimitRanger,ResourceQuota
- --audit-log-maxage=30
- --audit-log-maxbackup=10
- --audit-log-maxsize=100
- --request-timeout=300s
- --storage-backend=etcd3
- --etcd-compaction-interval=5m

6.2 节点资源优化


💡 kubelet关键参数调优:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# kubelet-config.yaml  
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 110
podsPerCore: 10
enableControllerAttachDetach: true
hairpinMode: promiscuous-bridge
serializeImagePulls: false
registryPullQPS: 10
registryBurst: 20
eventRecordQPS: 50
eventBurst: 100
kubeAPIQPS: 50
kubeAPIBurst: 100
systemReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 2Gi
kubeReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
imagefs.available: "15%"

6.3 网络性能优化


🌐 Calico网络调优配置:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# calico-config.yaml  
apiVersion: v1
kind: ConfigMap
metadata:
name: calico-config
namespace: kube-system
data:
calico_backend: "bird"
cluster_type: "k8s,bgp"
felix_ipinipmtu: "1440"
felix_vxlanmtu: "1410"
felix_wireguardmtu: "1420"
felix_bpfenabled: "true"
felix_bpflogfilters: "all"
felix_prometheusmetricsenabled: "true"
felix_prometheusmetricsport: "9091"
felix_iptablesbackend: "nft"
felix_chaininsertmode: "insert"

🎯 第七章:实战案例 - 踩坑血泪史


案例1:双11大促流量暴增应对


背景: 电商平台双11期间,流量突增10倍,集群资源告急


问题现象:





解决方案:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 1. 紧急扩容HPA配置  
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 10
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60

经验总结:





案例2:etcd数据损坏恢复


背景: 生产环境etcd因磁盘故障导致数据损坏


问题现象:





恢复过程:


1
2
3
4
5
6
7
8
9
10
11
12
# 1. 紧急评估数据损坏程度  
etcdctl endpoint health --cluster

# 2. 从备份快速恢复
./etcd-restore.sh backup-20231201-030000.db

# 3. 验证集群状态
kubectl get nodes
kubectl get pods --all-namespaces

# 4. 重建丢失的配置
kubectl apply -f critical-workloads/

防范措施:





🎉 总结:运维之路永无止境


经过这么多年的摸爬滚打,我深深体会到Kubernetes运维的复杂性和挑战性。但正是这些挑战让我们不断成长,让技能更加扎实。


🔥 关键要点回顾:


  1. 预防胜于治疗 - 完善的监控告警体系是基础

  1. 自动化一切 - 手工操作必然出错,自动化才是王道

  1. 备份备份备份 - 重要的事情说三遍,数据无价

  1. 文档化流程 - 标准化操作流程,避免人为失误

  1. 持续学习 - 技术更新快,保持学习心态