k8s 開船記-觸礁:四渦輪發動機撞壞3個引發502故障

(圖片來自網絡)

非常抱歉,這次開船觸礁故障給您帶來麻煩了,請您諒解。

在我們昨天發布 k8s 開船記首航博文后,有園友在評論中發來賀詞——“泰坦尼克號出發了[狗頭]”,借此吉言,今天船就觸礁了,還好不是冰山。在觸礁后,我們收到了唯一一封賀電,賀電署名——“隔壁正在打醬油的 docker swarm 集群”。

觸礁時間發生在今天上午 10:18~10:30 左右,當時航行用的是四渦輪發動機(4個nodes)。

10:18 左右開始,3與4號發動機(k8s-n3與k8s-n4節點)被撞壞熄火,重新點火屢屢失敗(重啟 blog-web pod 失敗),syslog 錯誤日志如下。

Dec 14 10:18:01 k8s-n3 kubelet[702]: E1214 10:18:01.739352     702 pod_workers.go:191] 
Error syncing pod 9b27ac6f-5518-4e12-862f-63b1254457d2 ("blog-web-r4zld_production(9b27ac6f-5518-4e12-862f-63b1254457d2)"), skipping: failed to "StartContainer" for "blog-web" with CrashLoopBackOff: "back-off 2m40s restarting failed container=blog-web pod=blog-web-r4zld_production(9b27ac6f-5518-4e12-862f-63b1254457d2)

10:20 左右,2號發動機(k8s-n2)也被撞壞熄火。

Dec 14 10:20:12 k8s-n2 kubelet[703]: E1214 10:20:12.138738     703 pod_workers.go:191] 
Error syncing pod 4ab7b193-cf0d-4a41-b83a-689d546acb2f ("blog-web-4dh84_production(4ab7b193-cf0d-4a41-b83a-689d546acb2f)"), skipping: failed to "StartContainer" for "blog-web" with CrashLoopBackOff: "back-off 2m40s restarting failed container=blog-web pod=blog-web-4dh84_production(4ab7b193-cf0d-4a41-b83a-689d546acb2f)"

唯一幸免的是1號發動機(k8s-n1),但是縱使它使盡渾身解數也無法驅動巨輪前進,于是只能停船發 502 求救信號。

我們收到求救信號后,通過下面的命令手動修改了 livenessProbe 的超時時間,daemonset 重新部署 pods 后恢復了正常。

kubectl edit daemonset blog-web

之后,我們啟動了5號發動機(k8s-n5),k8s 尼克號又出發了。

對于故障原因,有待進一步排查。

blog-web daemonset 的健康檢查配置:

livenessProbe:
    httpGet:
    path: /alive
    port: 80
    initialDelaySeconds: 10
    periodSeconds: 3
readinessProbe:
    exec:
    command:
        - curl
        - -H
        - 'X-Forwarded-Proto:https'
        - --resolve
        - www.nxrnyq.tw:80:127.0.0.1
        - www.nxrnyq.tw
    initialDelaySeconds: 30
    periodSeconds: 5

以下的 syslog 錯誤日志有待排查確認:

Dec 14 10:18:53 k8s-n2 dockerd[1045]: time="2019-12-14T10:18:53.719195677+08:00" level=info msg="Container ddf3e4ed0dd63878dd1c87cb63cfd57d712f8719fb097e6c8ef15587eb3f81da failed to exit within 30 seconds of signal 15 - using the force"

Dec 14 10:18:54 k8s-n2 dockerd[1045]: time="2019-12-14T10:18:54.008174148+08:00" level=error msg="stream copy error: reading from a closed fifo"

Dec 14 10:18:54 k8s-n2 dockerd[1045]: time="2019-12-14T10:18:54.056924047+08:00" level=error msg="Error running exec 827374c9541db5b8d69383798c961078cba8fee08d1c8b93e84622b6a9caa61c in container: OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown"

Dec 14 10:18:54 k8s-n2 dockerd[1045]: time="2019-12-14T10:18:54.129287298+08:00" level=warning msg="ddf3e4ed0dd63878dd1c87cb63cfd57d712f8719fb097e6c8ef15587eb3f81da cleanup: failed to unmount IPC: umount /var/lib/docker/containers/ddf3e4ed0dd63878dd1c87cb63cfd57d712f8719fb097e6c8ef15587eb3f81da/mounts/shm, flags: 0x2: no such file or directory"
posted @ 2019-12-14 13:19  博客園團隊  閱讀(...)  評論(...編輯  收藏
四川金7乐历史开奖号码查询