Kubernetes 生产环境实战:从集群规划到故障恢复完整指南
Kubernetes 生产环境实战:从集群规划到故障恢复完整指南
💡 前言: 作为一名在生产环境管理超过100个K8s集群的运维老兵,我踩过的坑可能比你见过的还要多。这篇文章将毫无保留地分享我在实际项目中总结的血泪经验,帮你避开那些让人头疼的生产事故。
🎯 为什么这篇文章值得收藏?
- ✅ 真实案例驱动:每个最佳实践都来自生产环境血泪教训
🏗️ 第一章:集群规划 - 架构决定命运
1.1 硬件资源规划的黄金法则
❌ 常见误区:
1 2 3
| # 很多人这样规划资源 Master节点:2C4G × 3 Worker节点:4C8G × 5
|
✅ 生产环境最佳实践:
1 2 3 4
| # 基于业务负载的科学规划 Master节点:4C8G × 3 (奇数个,避免脑裂) Worker节点:8C16G × N (根据业务峰值 × 1.5倍规划) ETCD独立部署:2C4G × 3 (SSD存储必备)
|
💰 成本优化技巧:
- • 使用Spot实例作为Worker节点,成本降低70%
- • 启用垂直Pod自动扩缩(VPA),让系统自动调优
1.2 网络架构设计要点
Flannel vs Calico vs Cilium 选型对比:
网络插件 |
性能 |
安全策略 |
复杂度 |
推荐场景 |
Flannel |
⭐⭐⭐ |
❌ |
⭐ |
小型集群,快速部署 |
Calico |
⭐⭐⭐⭐ |
✅ |
⭐⭐⭐ |
中大型集群,需要网络策略 |
Cilium |
⭐⭐⭐⭐⭐ |
✅ |
⭐⭐⭐⭐ |
高性能要求,云原生环境 |
🔥 生产环境推荐配置(Calico):
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: calicoNetwork: ipPools: - blockSize: 26 cidr: 10.244.0.0/16 encapsulation: VXLAN natOutgoing: Enabled nodeMetricsPort: 9091
|
⚙️ 第二章:集群部署 - 自动化是王道
2.1 使用Kubeadm快速部署
🚀 一键部署脚本(亲测可用):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
| #!/bin/bash # K8s集群自动化部署脚本 v2.0 set -e # 环境检查 check_requirements() { echo "🔍 检查系统环境..." # 检查操作系统 if [[ ! -f /etc/redhat-release ]] && [[ ! -f /etc/debian_version ]]; then echo "❌ 仅支持 CentOS/RHEL 或 Ubuntu 系统" exit 1 fi # 检查内存 total_mem=$(free -m | awk 'NR==2{printf "%.0f", $2}') if [[ $total_mem -lt 2048 ]]; then echo "⚠️ 警告:内存小于2GB,可能影响集群稳定性" fi echo "✅ 环境检查通过" } # 系统初始化 init_system() { echo "🛠️ 初始化系统配置..." # 关闭防火墙和SELinux systemctl disable --now firewalld setenforce 0 sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config # 关闭swap swapoff -a sed -i '/ swap / s/^\(.\*\)$/#\1/g' /etc/fstab # 加载内核模块 cat <<EOF | tee /etc/modules-load.d/k8s.conf br_netfilter overlay EOF modprobe br_netfilter modprobe overlay # 配置内核参数 cat <<EOF | tee /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-iptables = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1 EOF sysctl --system echo "✅ 系统初始化完成" } # 安装容器运行时 install_containerd() { echo "📦 安装 containerd..." # 安装依赖 yum install -y yum-utils device-mapper-persistent-data lvm2 # 添加Docker仓库 yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo # 安装containerd yum install -y containerd.io # 配置containerd mkdir -p /etc/containerd containerd config default | tee /etc/containerd/config.toml # 使用systemd cgroup driver sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml # 配置镜像加速器 sed -i 's|registry.k8s.io|registry.aliyuncs.com/google_containers|g' /etc/containerd/config.toml systemctl enable --now containerd echo "✅ containerd 安装完成" } # 安装kubeadm、kubelet、kubectl install_kubernetes() { echo "🎯 安装 Kubernetes 组件..." cat <<EOF | tee /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/ enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg EOF yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes systemctl enable --now kubelet echo "✅ Kubernetes 组件安装完成" } # 主节点调用 main() { check_requirements init_system install_containerd install_kubernetes echo "🎉 集群基础环境准备完成!" echo "📋 接下来执行:" echo " Master节点: kubeadm init --config=kubeadm-config.yaml" echo " Worker节点: kubeadm join <master-ip>:6443 --token <token>" } main "$@"
|
2.2 高可用Master节点配置
kubeadm-config.yaml 生产配置模板:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| apiVersion: kubeadm.k8s.io/v1beta3 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.1.10 bindPort: 6443 --- apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration kubernetesVersion: v1.28.2 controlPlaneEndpoint: "k8s-api.example.com:6443" networking: serviceSubnet: "10.96.0.0/16" podSubnet: "10.244.0.0/16" dnsDomain: "cluster.local" etcd: external: endpoints: - https://192.168.1.11:2379 - https://192.168.1.12:2379 - https://192.168.1.13:2379 caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/etcd/server.crt keyFile: /etc/kubernetes/pki/etcd/server.key apiServer: certSANs: - "k8s-api.example.com" - "192.168.1.10" - "192.168.1.20" - "192.168.1.30" extraArgs: audit-log-maxage: "30" audit-log-maxbackup: "10" audit-log-maxsize: "100" audit-log-path: "/var/log/audit.log" --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration cgroupDriver: systemd maxPods: 110
|
📊 第三章:监控告警 - 防患于未然
3.1 Prometheus + Grafana 黄金组合
🔥 生产级监控栈部署:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| # prometheus-values.yaml prometheus: prometheusSpec: retention: 30d resources: requests: memory: 2Gi cpu: 1000m limits: memory: 8Gi cpu: 2000m storageSpec: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi additionalScrapeConfigs: - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ grafana: persistence: enabled: true size: 20Gi storageClassName: fast-ssd plugins: - grafana-piechart-panel - grafana-kubernetes-app dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 options: path: /var/lib/grafana/dashboards
|
3.2 关键监控指标和告警规则
💀 生产环境必备告警规则:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| # critical-alerts.yaml groups: - name: kubernetes-critical rules: - alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.node }} is not ready" description: "Node {{ $labels.node }} has been not ready for more than 5 minutes." - alert: KubernetesPodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently." - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: "Node {{ $labels.node }} has memory pressure" description: "Node {{ $labels.node }} is under memory pressure." - alert: EtcdClusterDown expr: up{job="etcd"} == 0 for: 1m labels: severity: critical annotations: summary: "Etcd cluster is down" description: "Etcd cluster has been down for more than 1 minute."
|
🛡️ 第四章:安全加固 - 安全无小事
4.1 RBAC权限控制最佳实践
🔐 细粒度权限控制示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| # developer-rbac.yaml apiVersion: v1 kind: Namespace metadata: name: development --- apiVersion: v1 kind: ServiceAccount metadata: namespace: development name: developer --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: development name: developer-role rules: - apiGroups: [""] resources: ["pods", "services", "configmaps", "secrets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["apps"] resources: ["deployments", "replicasets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: developer-binding namespace: development subjects: - kind: ServiceAccount name: developer namespace: development roleRef: kind: Role name: developer-role apiGroup: rbac.authorization.k8s.io
|
4.2 Pod安全策略配置
🛡️ 强化Pod安全配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| # pod-security-policy.yaml apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted-psp spec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' - 'persistentVolumeClaim' runAsUser: rule: 'MustRunAsNonRoot' seLinux: rule: 'RunAsAny' fsGroup: rule: 'RunAsAny'
|
🚨 第五章:故障恢复 - 临危不乱的救命技能
5.1 常见故障诊断流程
🩺 系统性故障诊断checklist:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| #!/bin/bash # k8s-health-check.sh - 集群健康检查脚本 echo "🏥 开始Kubernetes集群健康检查..." # 1. 检查节点状态 echo "📋 1. 检查节点状态" kubectl get nodes -o wide echo "" # 2. 检查系统Pod状态 echo "📋 2. 检查系统Pod状态" kubectl get pods -n kube-system echo "" # 3. 检查存储类 echo "📋 3. 检查存储类" kubectl get storageclass echo "" # 4. 检查网络插件 echo "📋 4. 检查网络插件状态" kubectl get pods -n kube-system | grep -E "(calico|flannel|weave|cilium)" echo "" # 5. 检查API服务器连通性 echo "📋 5. 检查API服务器" kubectl cluster-info echo "" # 6. 检查etcd状态 echo "📋 6. 检查etcd集群状态" kubectl get pods -n kube-system | grep etcd echo "" # 7. 检查资源使用情况 echo "📋 7. 检查资源使用情况" kubectl top nodes echo "" # 8. 检查事件 echo "📋 8. 最近的集群事件" kubectl get events --sort-by=.metadata.creationTimestamp | tail -10 echo "" echo "✅ 健康检查完成!"
|
5.2 etcd备份恢复实战
💾 自动化etcd备份脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| #!/bin/bash # etcd-backup.sh - etcd自动备份脚本 BACKUP_DIR="/opt/etcd-backup" DATE=$(date +%Y%m%d-%H%M%S) BACKUP_NAME="etcd-backup-${DATE}.db" # 创建备份目录 mkdir -p ${BACKUP_DIR} # 执行备份 ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_DIR}/\${BACKUP_NAME} \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # 验证备份 ETCDCTL_API=3 etcdctl snapshot status \${BACKUP_DIR}/\${BACKUP_NAME} \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key if [ $? -eq 0 ]; then echo "✅ etcd备份成功: ${BACKUP_NAME}" # 清理7天前的备份 find ${BACKUP_DIR} -name "etcd-backup-*.db" -mtime +7 -delete # 上传到云存储(可选) # aws s3 cp ${BACKUP_DIR}/${BACKUP_NAME} s3://your-backup-bucket/etcd/ else echo "❌ etcd备份失败" exit 1 fi
|
🔄 etcd紧急恢复步骤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| # 1. 停止所有Master节点的etcd和api-server systemctl stop etcd kubelet # 2. 移除旧的etcd数据 mv /var/lib/etcd /var/lib/etcd.backup # 3. 从备份恢复 ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup/etcd-backup-latest.db \ --data-dir=/var/lib/etcd \ --name=master1 \ --initial-cluster=master1=https://192.168.1.10:2380 \ --initial-cluster-token=etcd-cluster-1 \ --initial-advertise-peer-urls=https://192.168.1.10:2380 # 4. 修复权限 chown -R etcd:etcd /var/lib/etcd # 5. 重启服务 systemctl start etcd kubelet
|
5.3 紧急故障处理手册
🚨 生产环境紧急处理流程:
故障类型 |
症状 |
立即处理 |
根因分析 |
节点宕机 |
Node NotReady |
1. 检查网络连通性 2. 重启kubelet服务 3. 驱逐Pod到其他节点 |
硬件故障/网络分区 |
Pod无法启动 |
Pending/ImagePullBackOff |
1. 检查资源配额 2. 验证镜像地址 3. 检查节点选择器 |
资源不足/镜像问题 |
网络故障 |
服务无法访问 |
1. 检查网络插件状态 2. 重启网络相关Pod 3. 验证iptables规则 |
CNI插件异常 |
存储问题 |
PVC Pending |
1. 检查StorageClass 2. 验证存储后端 3. 清理卡住的PV |
存储供应商问题 |
⚡ 第六章:性能优化 - 让集群飞起来
6.1 API Server调优
🚀 高并发API Server配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| # kube-apiserver优化参数 apiVersion: v1 kind: Pod metadata: name: kube-apiserver namespace: kube-system spec: containers: - command: - kube-apiserver - --advertise-address=192.168.1.10 - --max-requests-inflight=2000 - --max-mutating-requests-inflight=1000 - --default-watch-cache-size=500 - --watch-cache-sizes=nodes#1000,pods#5000 - --enable-admission-plugins=NodeRestriction,LimitRanger,ResourceQuota - --audit-log-maxage=30 - --audit-log-maxbackup=10 - --audit-log-maxsize=100 - --request-timeout=300s - --storage-backend=etcd3 - --etcd-compaction-interval=5m
|
6.2 节点资源优化
💡 kubelet关键参数调优:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| # kubelet-config.yaml apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration maxPods: 110 podsPerCore: 10 enableControllerAttachDetach: true hairpinMode: promiscuous-bridge serializeImagePulls: false registryPullQPS: 10 registryBurst: 20 eventRecordQPS: 50 eventBurst: 100 kubeAPIQPS: 50 kubeAPIBurst: 100 systemReserved: cpu: 500m memory: 1Gi ephemeral-storage: 2Gi kubeReserved: cpu: 500m memory: 1Gi ephemeral-storage: 1Gi evictionHard: memory.available: "200Mi" nodefs.available: "10%" imagefs.available: "15%"
|
6.3 网络性能优化
🌐 Calico网络调优配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| # calico-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: calico-config namespace: kube-system data: calico_backend: "bird" cluster_type: "k8s,bgp" felix_ipinipmtu: "1440" felix_vxlanmtu: "1410" felix_wireguardmtu: "1420" felix_bpfenabled: "true" felix_bpflogfilters: "all" felix_prometheusmetricsenabled: "true" felix_prometheusmetricsport: "9091" felix_iptablesbackend: "nft" felix_chaininsertmode: "insert"
|
🎯 第七章:实战案例 - 踩坑血泪史
案例1:双11大促流量暴增应对
背景: 电商平台双11期间,流量突增10倍,集群资源告急
问题现象:
解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| # 1. 紧急扩容HPA配置 apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 10 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 70 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 10 periodSeconds: 60
|
经验总结:
案例2:etcd数据损坏恢复
背景: 生产环境etcd因磁盘故障导致数据损坏
问题现象:
恢复过程:
1 2 3 4 5 6 7 8 9 10 11 12
| # 1. 紧急评估数据损坏程度 etcdctl endpoint health --cluster # 2. 从备份快速恢复 ./etcd-restore.sh backup-20231201-030000.db # 3. 验证集群状态 kubectl get nodes kubectl get pods --all-namespaces # 4. 重建丢失的配置 kubectl apply -f critical-workloads/
|
防范措施:
🎉 总结:运维之路永无止境
经过这么多年的摸爬滚打,我深深体会到Kubernetes运维的复杂性和挑战性。但正是这些挑战让我们不断成长,让技能更加扎实。
🔥 关键要点回顾:
- 预防胜于治疗 - 完善的监控告警体系是基础
- 自动化一切 - 手工操作必然出错,自动化才是王道
- 备份备份备份 - 重要的事情说三遍,数据无价
- 文档化流程 - 标准化操作流程,避免人为失误
- 持续学习 - 技术更新快,保持学习心态