docker 卡死引起的 container runtime is down

1 背景

最近升级了一版 kubelet,修复因 kubelet 删除 Pod 慢导致平台删除集群超时的问题。在灰度 redis 隔离集群的时候,发现升级 kubelet 并重启服务后,少量宿主状态变成了 NotReady,并且回滚 kubelet 至之前版本,宿主状态仍然是 NotReady。查看宿主状态时提示 ‘container runtime is down’ ,根据经验,此时一般就是容器运行时出了问题。弹性云使用的容器运行时是 docker,我们就去检查 docker 的状态,检测结果如下:

  • docker ps 查看所有容器状态,执行正常
  • docker inspect 查看某一容器详细状态,执行阻塞

典型的 docker hang 死行为。因为我们最近在升级 docker 版本,存量宿主 docker 的版本为 1.13.1,并且在逐步升级至 18.06.3,新宿主的 docker 版本都是 18.06.3。docker hang 死问题在 1.13.1 版本上表现得更彻底,在执行 docker ps 的时候就已经 hang 死了,一旦某个容器出了问题,docker 就处于无响应状态;而 docker 18.06.3 做了一点小小的优化,在执行 docker ps 时去掉了针对容器级别的加锁操作,但是 docker inspect 依然会加容器锁,因此某一个容器出现问题,并不会造成 docker 服务不可响应,受影响的也仅仅是该容器,无法执行任何操作。

至于为什么以 docker psdocker inspect 为指标检查 docker 状态,因为 kubelet 就是依赖这两个 docker API 获取容器状态。

所以,现在问题有二:

  1. docker hang 死的根因是什么?
  2. docker hang 死时,为什么重启 kubelet,会导致宿主状态变为 NotReady?

2 重启 kubelet 变更宿主状态

kubelet 重启后宿主状态从 Ready 变为 NotReady,这个问题相较 docker hang 死而言,没有那么复杂,所以我们先排查这个问题。

kubelet 针对宿主会设置多个 Condition,表明宿主当前所处的状态,比如宿主内存是否告急、线程数是否告急,以及宿主是否就绪。其中 ReadyCondition 表明宿主是否就绪,kubectl 查看宿主状态时,展示的 Statue 信息就是 ReadCondition 的内容,常见的状态及其含义定义如下:

  1. Ready 状态:表明当前宿主状态一切 OK,能正常响应 Pod 事件
  2. NotReady 状态:表明宿主的 kubelet 仍在运行,但是此时已经无法处理 Pod 事件。NotReady 绝大多数情况都是容器运行时出了问题
  3. Unknown 状态:表明宿主 kubelet 已停止运行

kubelet 定义的 ReadyCondition 的判定条件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// defaultNodeStatusFuncs is a factory that generates the default set of
// setNodeStatus funcs
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {
......
setters = append(setters,
nodestatus.OutOfDiskCondition(kl.clock.Now, kl.recordNodeStatusEvent),
nodestatus.MemoryPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderMemoryPressure, kl.recordNodeStatusEvent),
nodestatus.DiskPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderDiskPressure, kl.recordNodeStatusEvent),
nodestatus.PIDPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderPIDPressure, kl.recordNodeStatusEvent),
nodestatus.ReadyCondition(kl.clock.Now, kl.runtimeState.runtimeErrors, kl.runtimeState.networkErrors, validateHostFunc, kl.containerManager.Status, kl.recordNodeStatusEvent),
nodestatus.VolumesInUse(kl.volumeManager.ReconcilerStatesHasBeenSynced, kl.volumeManager.GetVolumesInUse),
// TODO(mtaufen): I decided not to move this setter for now, since all it does is send an event
// and record state back to the Kubelet runtime object. In the future, I'd like to isolate
// these side-effects by decoupling the decisions to send events and partial status recording
// from the Node setters.
kl.recordNodeSchedulableEvent,
)
return setters
}

深入 nodestatus.ReadyCondition 的实现可以发现,宿主是否 Ready 取决于很多条件,包含运行时判定、网络判定、基本资源判定等。这里我们只需关注运行时判定即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func (s *runtimeState) runtimeErrors() []string {
s.RLock()
defer s.RUnlock()
var ret []string
if !s.lastBaseRuntimeSync.Add(s.baseRuntimeSyncThreshold).After(time.Now()) { // 1
ret = append(ret, "container runtime is down")
}
if s.internalError != nil {
ret = append(ret, s.internalError.Error())
}
for _, hc := range s.healthChecks { // 2
if ok, err := hc.fn(); !ok {
ret = append(ret, fmt.Sprintf("%s is not healthy: %v", hc.name, err))
}
}
return ret
}

当出现如下两种状况之一时,则判定运行时检查不通过:

  1. 距最近一次运行时同步操作的时间间隔超过指定阈值(默认 30s)
  2. 运行时健康检查未通过

那么,当时宿主的 NotReady 是由哪种状况引起的呢?结合 kubelet 日志分析,kubelet 每隔 5s 就输出一条日志:

1
2
3
4
5
......
I0715 10:43:28.049240 16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
I0715 10:43:33.049359 16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
I0715 10:43:38.049492 16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
......

因此,状况 1 是宿主 NotReady 的元凶。

我们继续分析为什么 kubelet 没有按照预期设置 lastBaseRuntimeSync。kubelet 启动时会创建一个 goroutine,并在该 goroutine 中循环设置 lastBaseRuntimeSync,循环如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
......
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
......
}

func (kl *Kubelet) updateRuntimeUp() {
kl.updateRuntimeMux.Lock()
defer kl.updateRuntimeMux.Unlock()
......
kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
kl.runtimeState.setRuntimeSync(kl.clock.Now())
}

正常情况下,kubelet 每隔 5s 会将 lastBaseRuntimeSync 设置为当前时间,而宿主状态异常时,这个时间戳一直未被更新。也即 updateRuntimeUp 一直被阻塞在设置 lastBaseRuntimeSync 之前的某一步。我们只需逐个排查 updateRuntimeUp 内的函数调用即可,具体过程不再展示,最终的函数调用链路如下:

1
initializeRuntimeDependentModules -> kl.cadvisor.Start -> cc.Manager.Start -> self.createContainer -> m.createContainerLocked -> container.NewContainerHandler -> factory.CanHandleAndAccept -> self.client.ContainerInspect

由于某个容器状态异常,kubelet 执行 docker inspect 操作也被 hang 死。

因此,重启 kubelet 引起宿主状态从 Ready 变为 NotReady,其根因在于某个容器状态异常,执行 docker inspect 时被 hang 死。而如果 docker inspect hang 死发生在 kubelet 重启之后,则不会对宿主的 Ready 状态造成任何影响,因为 oneTimeInitializer 是 sync.Once 类型,也即仅仅会在 kebelet 启动时执行一次。那时 kubelet 仅仅是不能处理该 Pod 相关的任何事件,包含删除、变更等,但是仍然能够处理其他 Pod 的任意事件。

可能有人会问,为什么 kubelet 重启时访问 docker inspect 操作不加超时控制?确实,如果添加了超时控制,kubelet 重启不会引起宿主状态变更。待详细挖掘后再来补充,我们先继续分析 docker hang 死的问题。

3 docker hang 死

我们对 docker hang 死并不陌生,因为已经发生了好多起。其发生时的现象也多种多样。以往针对 docker 1.13.1 版本的排查都发现了一些线索,但是并没有定位到根因,最终绝大多数也是通过重启 docker 解决。而这一次发生在 docker 18.06.3 版本的 docker hang 死行为,经过我们 4 人小分队接近一周的望闻问切,终于确定了其病因。注意,docker hang 死的原因不止一种,因此本处方并非是个万能药。

现在,我们掌握的知识仅仅是 docker 异常了,无法响应特定容器的 docker inspect 操作,而对详细信息则一无所知。

3.1 链路跟踪

首先,我们希望对 docker 运行的全局状况有一个大致的了解,熟悉 go 语言开发的用户自然能联想到神器 pprof。我们借助 pprof 描绘出了 docker 当时运行的蓝图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
goroutine profile: total 722373
717594 @ 0x7fe8bc202980 0x7fe8bc202a40 0x7fe8bc2135d8 0x7fe8bc2132ef 0x7fe8bc238c1a 0x7fe8bd56f7fe 0x7fe8bd56f6bd 0x7fe8bcea8719 0x7fe8bcea938b 0x7fe8bcb726ca 0x7fe8bcb72b01 0x7fe8bc71c26b 0x7fe8bcb85f4a 0x7fe8bc4b9896 0x7fe8bc72a438 0x7fe8bcb849e2 0x7fe8bc4bc67e 0x7fe8bc4b88a3 0x7fe8bc230711
# 0x7fe8bc2132ee sync.runtime_SemacquireMutex+0x3e /usr/local/go/src/runtime/sema.go:71
# 0x7fe8bc238c19 sync.(*Mutex).Lock+0x109 /usr/local/go/src/sync/mutex.go:134
# 0x7fe8bd56f7fd github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent+0x8d /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/inspect.go:40
# 0x7fe8bd56f6bc github.com/docker/docker/daemon.(*Daemon).ContainerInspect+0x11c /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/inspect.go:29
# 0x7fe8bcea8718 github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName+0x118 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/inspect.go:15
# 0x7fe8bcea938a github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm+0x6a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:39
# 0x7fe8bcb726c9 github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1+0xd9 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/experimental.go:26
# 0x7fe8bcb72b00 github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1+0x400 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:62
# 0x7fe8bc71c26a github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1+0x7aa /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/pkg/authorization/middleware.go:59
# 0x7fe8bcb85f49 github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1+0x199 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/server.go:141
# 0x7fe8bc4b9895 net/http.HandlerFunc.ServeHTTP+0x45 /usr/local/go/src/net/http/server.go:1947
# 0x7fe8bc72a437 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0x227 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103
# 0x7fe8bcb849e1 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP+0x71 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29
# 0x7fe8bc4bc67d net/http.serverHandler.ServeHTTP+0xbd /usr/local/go/src/net/http/server.go:2694
# 0x7fe8bc4b88a2 net/http.(*conn).serve+0x652 /usr/local/go/src/net/http/server.go:1830

4175 @ 0x7fe8bc202980 0x7fe8bc202a40 0x7fe8bc2135d8 0x7fe8bc2132ef 0x7fe8bc238c1a 0x7fe8bcc2eccf 0x7fe8bd597af4 0x7fe8bcea2456 0x7fe8bcea956b 0x7fe8bcb73dff 0x7fe8bcb726ca 0x7fe8bcb72b01 0x7fe8bc71c26b 0x7fe8bcb85f4a 0x7fe8bc4b9896 0x7fe8bc72a438 0x7fe8bcb849e2 0x7fe8bc4bc67e 0x7fe8bc4b88a3 0x7fe8bc230711
# 0x7fe8bc2132ee sync.runtime_SemacquireMutex+0x3e /usr/local/go/src/runtime/sema.go:71
# 0x7fe8bc238c19 sync.(*Mutex).Lock+0x109 /usr/local/go/src/sync/mutex.go:134
# 0x7fe8bcc2ecce github.com/docker/docker/container.(*State).IsRunning+0x2e /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/container/state.go:240
# 0x7fe8bd597af3 github.com/docker/docker/daemon.(*Daemon).ContainerStats+0xb3 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/stats.go:30
# 0x7fe8bcea2455 github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersStats+0x1e5 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:115
# 0x7fe8bcea956a github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersStats)-fm+0x6a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:42
# 0x7fe8bcb73dfe github.com/docker/docker/api/server/router.cancellableHandler.func1+0xce /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/local.go:92
# 0x7fe8bcb726c9 github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1+0xd9 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/experimental.go:26
# 0x7fe8bcb72b00 github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1+0x400 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:62
# 0x7fe8bc71c26a github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1+0x7aa /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/pkg/authorization/middleware.go:59
# 0x7fe8bcb85f49 github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1+0x199 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/server.go:141
# 0x7fe8bc4b9895 net/http.HandlerFunc.ServeHTTP+0x45 /usr/local/go/src/net/http/server.go:1947
# 0x7fe8bc72a437 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0x227 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103
# 0x7fe8bcb849e1 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP+0x71 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29
# 0x7fe8bc4bc67d net/http.serverHandler.ServeHTTP+0xbd /usr/local/go/src/net/http/server.go:2694
# 0x7fe8bc4b88a2 net/http.(*conn).serve+0x652 /usr/local/go/src/net/http/server.go:1830

1 @ 0x7fe8bc202980 0x7fe8bc202a40 0x7fe8bc2135d8 0x7fe8bc2131fb 0x7fe8bc239a3b 0x7fe8bcbb679d 0x7fe8bcc26774 0x7fe8bd570b20 0x7fe8bd56f81c 0x7fe8bd56f6bd 0x7fe8bcea8719 0x7fe8bcea938b 0x7fe8bcb726ca 0x7fe8bcb72b01 0x7fe8bc71c26b 0x7fe8bcb85f4a 0x7fe8bc4b9896 0x7fe8bc72a438 0x7fe8bcb849e2 0x7fe8bc4bc67e 0x7fe8bc4b88a3 0x7fe8bc230711
# 0x7fe8bc2131fa sync.runtime_Semacquire+0x3a /usr/local/go/src/runtime/sema.go:56
# 0x7fe8bc239a3a sync.(*RWMutex).RLock+0x4a /usr/local/go/src/sync/rwmutex.go:50
# 0x7fe8bcbb679c github.com/docker/docker/daemon/exec.(*Store).List+0x4c /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/exec/exec.go:140
# 0x7fe8bcc26773 github.com/docker/docker/container.(*Container).GetExecIDs+0x33 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/container/container.go:423
# 0x7fe8bd570b1f github.com/docker/docker/daemon.(*Daemon).getInspectData+0x5cf /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/inspect.go:178
# 0x7fe8bd56f81b github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent+0xab /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/inspect.go:42
# 0x7fe8bd56f6bc github.com/docker/docker/daemon.(*Daemon).ContainerInspect+0x11c /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/inspect.go:29
# 0x7fe8bcea8718 github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName+0x118 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/inspect.go:15
# 0x7fe8bcea938a github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm+0x6a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:39
# 0x7fe8bcb726c9 github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1+0xd9 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/experimental.go:26
# 0x7fe8bcb72b00 github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1+0x400 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:62
# 0x7fe8bc71c26a github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1+0x7aa /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/pkg/authorization/middleware.go:59
# 0x7fe8bcb85f49 github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1+0x199 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/server.go:141
# 0x7fe8bc4b9895 net/http.HandlerFunc.ServeHTTP+0x45 /usr/local/go/src/net/http/server.go:1947
# 0x7fe8bc72a437 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0x227 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103
# 0x7fe8bcb849e1 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP+0x71 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29
# 0x7fe8bc4bc67d net/http.serverHandler.ServeHTTP+0xbd /usr/local/go/src/net/http/server.go:2694
# 0x7fe8bc4b88a2 net/http.(*conn).serve+0x652 /usr/local/go/src/net/http/server.go:1830

1 @ 0x7fe8bc202980 0x7fe8bc212946 0x7fe8bc8b6881 0x7fe8bc8b699d 0x7fe8bc8e259b 0x7fe8bc8e1695 0x7fe8bc8c47d5 0x7fe8bd2e0c06 0x7fe8bd2eda96 0x7fe8bc8c42fb 0x7fe8bc8c4613 0x7fe8bd2a6474 0x7fe8bd2e6976 0x7fe8bd3661c5 0x7fe8bd56842f 0x7fe8bcea7bdb 0x7fe8bcea9f6b 0x7fe8bcb726ca 0x7fe8bcb72b01 0x7fe8bc71c26b 0x7fe8bcb85f4a 0x7fe8bc4b9896 0x7fe8bc72a438 0x7fe8bcb849e2 0x7fe8bc4bc67e 0x7fe8bc4b88a3 0x7fe8bc230711
# 0x7fe8bc8b6880 github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).waitOnHeader+0x100 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:222
# 0x7fe8bc8b699c github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).RecvCompress+0x2c /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:233
# 0x7fe8bc8e259a github.com/docker/docker/vendor/google.golang.org/grpc.(*csAttempt).recvMsg+0x63a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:515
# 0x7fe8bc8e1694 github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).RecvMsg+0x44 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:395
# 0x7fe8bc8c47d4 github.com/docker/docker/vendor/google.golang.org/grpc.invoke+0x184 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:83
# 0x7fe8bd2e0c05 github.com/docker/docker/vendor/github.com/containerd/containerd.namespaceInterceptor.unary+0xf5 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:35
# 0x7fe8bd2eda95 github.com/docker/docker/vendor/github.com/containerd/containerd.(namespaceInterceptor).(github.com/docker/docker/vendor/github.com/containerd/containerd.unary)-fm+0xf5 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:51
# 0x7fe8bc8c42fa github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).Invoke+0x10a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:35
# 0x7fe8bc8c4612 github.com/docker/docker/vendor/google.golang.org/grpc.Invoke+0xc2 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:60
# 0x7fe8bd2a6473 github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1.(*tasksClient).Start+0xd3 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1/tasks.pb.go:421
# 0x7fe8bd2e6975 github.com/docker/docker/vendor/github.com/containerd/containerd.(*process).Start+0xf5 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/containerd/containerd/process.go:109
# 0x7fe8bd3661c4 github.com/docker/docker/libcontainerd.(*client).Exec+0x4b4 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/libcontainerd/client_daemon.go:381
# 0x7fe8bd56842e github.com/docker/docker/daemon.(*Daemon).ContainerExecStart+0xb4e /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/exec.go:251
# 0x7fe8bcea7bda github.com/docker/docker/api/server/router/container.(*containerRouter).postContainerExecStart+0x34a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/exec.go:125
# 0x7fe8bcea9f6a github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.postContainerExecStart)-fm+0x6a /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:59
# 0x7fe8bcb726c9 github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1+0xd9 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/experimental.go:26
# 0x7fe8bcb72b00 github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1+0x400 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:62
# 0x7fe8bc71c26a github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1+0x7aa /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/pkg/authorization/middleware.go:59
# 0x7fe8bcb85f49 github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1+0x199 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/server.go:141
# 0x7fe8bc4b9895 net/http.HandlerFunc.ServeHTTP+0x45 /usr/local/go/src/net/http/server.go:1947
# 0x7fe8bc72a437 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0x227 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103
# 0x7fe8bcb849e1 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP+0x71 /root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29
# 0x7fe8bc4bc67d net/http.serverHandler.ServeHTTP+0xbd /usr/local/go/src/net/http/server.go:2694
# 0x7fe8bc4b88a2 net/http.(*conn).serve+0x652 /usr/local/go/src/net/http/server.go:1830

注意,这是一份精简后的 docker 协程栈信息。从上面的蓝图,我们可以总结出如下结论:

  1. 有 717594 个协程被阻塞在 docker inspect
  2. 有 4175 个协程被阻塞在 docker stats
  3. 有 1 个协程被阻塞在获取 docker exec 的任务 ID
  4. 有 1 个协程被阻塞在 docker exec 的执行过程

从上面的结论,我们基本了解了异常容器 hang 死的原因,在于该容器执行 docker exec 后未返回 (4),进而导致获取 docker exec 的任务 ID 阻塞 (3),由于 (3) 实现获取了容器锁,进而导致了 docker inspect (1) 与 docker stats (2) 卡死。所以病因并非是 docker inspect,而是 docker exec

要想继续往下挖掘,我们现在有必要补充一下背景知识。kubelet 启动容器或者在容器内执行命令的完整调用路径如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
+--------------------------------------------------------------+
| |
| +------------+ |
| | | |
| | kubelet | |
| | | |
| +------|-----+ |
| | |
| +------v-----+ +---------------+ |
| | | | | |
| | dockerd ------->| containerd | |
| | | | | |
| +------------+ +-------|-------+ |
| | |
| +-------v-------+ +-----------+ |
| | | | | |
| |containerd-shim----->| runc | |
| | | | | |
| +---------------+ +-----------+ |
| |
+--------------------------------------------------------------+

dockerd 与 containerd 可以当做两层 nginx 代理,containerd-shim 是容器的监护人,而 runc 则是容器启动与命令执行的真正工具人。runc 干的事情也非常简单:按照用户指定的配置创建 NS,或者进入特定 NS,然后执行用户命令。说白了,创建容器就是新建 NS,然后在该 NS 内执行用户指定的命令。

按照上面介绍的背景知识,我们继续往下探索 containerd。幸运的是,借助 pprof,我们也可以描绘出 containerd 此时的运行蓝图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
goroutine profile: total 430
1 @ 0x7f6e55f82740 0x7f6e55f92616 0x7f6e56a8412c 0x7f6e56a83d6d 0x7f6e56a911bf 0x7f6e56ac6e3b 0x7f6e565093de 0x7f6e5650dd3b 0x7f6e5650392b 0x7f6e56b51216 0x7f6e564e5909 0x7f6e563ec76a 0x7f6e563f000a 0x7f6e563f6791 0x7f6e55fb0151
# 0x7f6e56a8412b github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*Client).dispatch+0x24b /go/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/client.go:102
# 0x7f6e56a83d6c github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*Client).Call+0x15c /go/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/client.go:73
# 0x7f6e56a911be github.com/containerd/containerd/linux/shim/v1.(*shimClient).Start+0xbe /go/src/github.com/containerd/containerd/linux/shim/v1/shim.pb.go:1745
# 0x7f6e56ac6e3a github.com/containerd/containerd/linux.(*Process).Start+0x8a /go/src/github.com/containerd/containerd/linux/process.go:125
# 0x7f6e565093dd github.com/containerd/containerd/services/tasks.(*local).Start+0x14d /go/src/github.com/containerd/containerd/services/tasks/local.go:187
# 0x7f6e5650dd3a github.com/containerd/containerd/services/tasks.(*service).Start+0x6a /go/src/github.com/containerd/containerd/services/tasks/service.go:72
# 0x7f6e5650392a github.com/containerd/containerd/api/services/tasks/v1._Tasks_Start_Handler.func1+0x8a /go/src/github.com/containerd/containerd/api/services/tasks/v1/tasks.pb.go:624
# 0x7f6e56b51215 github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor+0xa5 /go/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29
# 0x7f6e564e5908 github.com/containerd/containerd/api/services/tasks/v1._Tasks_Start_Handler+0x168 /go/src/github.com/containerd/containerd/api/services/tasks/v1/tasks.pb.go:626
# 0x7f6e563ec769 github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).processUnaryRPC+0x849 /go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:920
# 0x7f6e563f0009 github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).handleStream+0x1319 /go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1142
# 0x7f6e563f6790 github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1+0xa0 /go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:637

同样,我们仅保留了关键的协程信息,从上面的协程栈可以看出,containerd 阻塞在接收 exec 返回结果处,附上关键代码佐证:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func (c *Client) dispatch(ctx context.Context, req *Request, resp *Response) error {
errs := make(chan error, 1)
call := &callRequest{
req: req,
resp: resp,
errs: errs,
}

select {
case c.calls <- call:
case <-c.done:
return c.err
}

select { // 此处对应上面协程栈 /go/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/client.go:102
case err := <-errs:
return filterCloseErr(err)
case <-c.done:
return c.err
}
}

containerd 将请求传递至 containerd-shim 之后,一直在等待 containerd-shim 的返回。

正常情况下,如果我们能够按照调用链路逐个分析每个组件的协程调用栈信息,我们能够很快的定位问题所在。不幸的是,由于线上 docker 没有开启 debug 模式,我们无法收集 containerd-shim 的 pprof 信息,并且 runc 也没有开启 pprof。因此单纯依赖协程调用链路定位问题这条路被堵死了。

截至目前,我们已经收集了部分关键信息,同时也将问题排查范围更进一步地缩小在 containerd-shim 与 runc 之间。接下来我们换一种思路继续排查。

3.2 进程排查

当组件的运行状态无法继续获取时,我们转换一下思维,获取容器的运行状态,也即异常容器此时的进程状态。

既然 docker ps 执行正常,而 docker inspect hang 死,首先我们定位异常容器,命令如下:

1
docker ps | grep -v NAME | awk '{print $1}' | while read cid; do echo $cid; docker inspect -f {{.State.Pid}} $cid; done

拿到异常容器的 ID 之后,我们就能扫描与该容器相关的所有进程:

1
2
3
4
root     11646  6655  0 Jun17 ?        00:01:04 docker-containerd-shim -namespace moby -workdir /home/docker_rt/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root 11680 11646 0 Jun17 ? 00:00:00 /dockerinit
root 15581 11646 0 Jun17 ? 00:00:00 docker-runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/log.json --log-format json exec --process /tmp/runc-process616674997 --detach --pid-file /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/0594c5897a41d401e4d1d7ddd44dacdd316c7e7d53bfdae7f16b0f6b26fcbcda.pid bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5
root 15638 15581 0 Jun17 ? 00:00:00 docker-runc init

核心进程列表如上,简单备注下:

  • 6655:containerd 进程
  • 11646:异常容器的 containerd-shim 进程
  • 11680:异常容器的容器启动进程。在容器内查看,因 PID NS 的隔离,该进程 ID 是 1
  • 15581:在异常容器内执行用户命令的进程
  • 15638:在异常容器内执行用户命令时,进入容器 NS 的进程

这里再补充一个背景知识:当我们启动容器时,首先会创建 runc init 进程,创建并进入新的容器 NS;而当我们在容器内执行命令时,首先也会创建 runc init 进程,进入容器的 NS。进入容器的隔离 NS 中,才会执行用户指定的命令。

面对上面的进程列表,我们无法直观地感受问题究竟由哪个进程引起。因此,我们还需要了解进程当前所处的状态。借助 strace,我们逐一展示进程的活动状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// 11646 (container-shim)
Process 11646 attached with 10 threads
[pid 37342] epoll_pwait(5, <unfinished ...>
[pid 11656] futex(0x818cc0, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11655] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 11654] futex(0x818bd8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11653] futex(0x7fc730, FUTEX_WAKE, 1 <unfinished ...>
[pid 11651] futex(0xc4200b4148, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11650] futex(0xc420082948, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11649] futex(0xc420082548, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11647] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 11646] futex(0x7fd008, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 11653] <... futex resumed> ) = 0
[pid 11647] <... restart_syscall resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 11653] epoll_wait(4, <unfinished ...>
[pid 11647] pselect6(0, NULL, NULL, NULL, {0, 20000}, 0) = 0 (Timeout)
[pid 11647] futex(0x7fc730, FUTEX_WAIT, 0, {60, 0}

// 11581 (runc exec)
Process 15581 attached with 7 threads
[pid 15619] read(6, <unfinished ...>
[pid 15592] futex(0xc4200be148, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15591] futex(0x7fd6d25f6238, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15590] futex(0xc420084d48, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15586] futex(0x7fd6d25f6320, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15584] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 15581] futex(0x7fd6d25d9b28, FUTEX_WAIT, 0, NULL

// 11638 (runc init)
Process 15638 attached with 7 threads
[pid 15648] futex(0x7f512cea5320, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15647] futex(0x7f512cea5238, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15645] futex(0xc4200bc148, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15643] futex(0xc420082d48, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15642] futex(0xc420082948, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 15639] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 15638] write(2, "/usr/local/go/src/runtime/proc.g"..., 33

从关联进程的活动状态,我们可以得出如下结论:

  • runc exec 在等待从 6 号 FD 读取数据
  • runc init 在等待从 2 号 FD 写入数据

这些 FD 究竟对应的是什么文件呢?我们借助 lsof 可以查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// 11638 (runc init)
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
runc:[2:I 15638 root cwd DIR 0,41 192 1066743071 /
runc:[2:I 15638 root rtd DIR 0,41 192 1066743071 /
runc:[2:I 15638 root txt REG 0,4 7644224 1070360467 /memfd:runc_cloned:/proc/self/exe (deleted)
runc:[2:I 15638 root mem REG 8,3 2107816 1053962 /usr/lib64/libc-2.17.so
runc:[2:I 15638 root mem REG 8,3 19512 1054285 /usr/lib64/libdl-2.17.so
runc:[2:I 15638 root mem REG 8,3 266688 1050626 /usr/lib64/libseccomp.so.2.3.1
runc:[2:I 15638 root mem REG 8,3 142296 1055698 /usr/lib64/libpthread-2.17.so
runc:[2:I 15638 root mem REG 8,3 27168 3024893 /usr/local/gundam/gundam_client/preload/lib64/gundam_preload.so
runc:[2:I 15638 root mem REG 8,3 164432 1054515 /usr/lib64/ld-2.17.so
runc:[2:I 15638 root 0r FIFO 0,8 0t0 1070361745 pipe
runc:[2:I 15638 root 1w FIFO 0,8 0t0 1070361746 pipe
runc:[2:I 15638 root 2w FIFO 0,8 0t0 1070361747 pipe
runc:[2:I 15638 root 3u unix 0xffff881ff8273000 0t0 1070361341 socket
runc:[2:I 15638 root 5u a_inode 0,9 0 7180 [eventpoll]

// 11581 (runc exec)
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
docker-ru 15581 root cwd DIR 0,18 120 1066743076 /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5
docker-ru 15581 root rtd DIR 8,3 4096 2 /
docker-ru 15581 root txt REG 8,3 7644224 919775 /usr/bin/docker-runc
docker-ru 15581 root mem REG 8,3 2107816 1053962 /usr/lib64/libc-2.17.so
docker-ru 15581 root mem REG 8,3 19512 1054285 /usr/lib64/libdl-2.17.so
docker-ru 15581 root mem REG 8,3 266688 1050626 /usr/lib64/libseccomp.so.2.3.1
docker-ru 15581 root mem REG 8,3 142296 1055698 /usr/lib64/libpthread-2.17.so
docker-ru 15581 root mem REG 8,3 27168 3024893 /usr/local/gundam/gundam_client/preload/lib64/gundam_preload.so
docker-ru 15581 root mem REG 8,3 164432 1054515 /usr/lib64/ld-2.17.so
docker-ru 15581 root 0r FIFO 0,8 0t0 1070361745 pipe
docker-ru 15581 root 1w FIFO 0,8 0t0 1070361746 pipe
docker-ru 15581 root 2w FIFO 0,8 0t0 1070361747 pipe
docker-ru 15581 root 3w REG 0,18 5456 1066709902 /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/log.json
docker-ru 15581 root 4u a_inode 0,9 0 7180 [eventpoll]
docker-ru 15581 root 6u unix 0xffff881ff8275400 0t0 1070361342 socket

// 11646 (container-shim)
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
docker-co 11646 root cwd DIR 0,18 120 1066743076 /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5
docker-co 11646 root rtd DIR 8,3 4096 2 /
docker-co 11646 root txt REG 8,3 4173632 919772 /usr/bin/docker-containerd-shim
docker-co 11646 root 0r CHR 1,3 0t0 2052 /dev/null
docker-co 11646 root 1w CHR 1,3 0t0 2052 /dev/null
docker-co 11646 root 2w CHR 1,3 0t0 2052 /dev/null
docker-co 11646 root 3r FIFO 0,8 0t0 1070361745 pipe
docker-co 11646 root 4u a_inode 0,9 0 7180 [eventpoll]
docker-co 11646 root 5u a_inode 0,9 0 7180 [eventpoll]
docker-co 11646 root 6u unix 0xffff881e8cac2800 0t0 1066743079 @/containerd-shim/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/shim.sock
docker-co 11646 root 7u unix 0xffff881e8cac3400 0t0 1066743968 @/containerd-shim/moby/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/shim.sock
docker-co 11646 root 8r FIFO 0,8 0t0 1066743970 pipe
docker-co 11646 root 9w FIFO 0,8 0t0 1070361745 pipe
docker-co 11646 root 10r FIFO 0,8 0t0 1066743971 pipe
docker-co 11646 root 11u FIFO 0,18 0t0 1066700778 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stdout
docker-co 11646 root 12r FIFO 0,8 0t0 1066743972 pipe
docker-co 11646 root 13w FIFO 0,18 0t0 1066700778 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stdout
docker-co 11646 root 14u FIFO 0,18 0t0 1066700778 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stdout
docker-co 11646 root 15r FIFO 0,18 0t0 1066700778 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stdout
docker-co 11646 root 16u FIFO 0,18 0t0 1066700779 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stderr
docker-co 11646 root 17w FIFO 0,18 0t0 1066700779 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stderr
docker-co 11646 root 18u FIFO 0,18 0t0 1066700779 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stderr
docker-co 11646 root 19r FIFO 0,18 0t0 1066700779 /run/docker/containerd/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/init-stderr
docker-co 11646 root 20r FIFO 0,8 0t0 1070361746 pipe
docker-co 11646 root 26r FIFO 0,8 0t0 1070361747 pipe

有心人结合 strace 与 lsof 的结果,已经能够自己得出结论:

runc init 往 2 号 FD 内写数据时阻塞,2 号 FD 对应的类型是 pipe 类型。而 linux pipe 有一个默认的数据大小,当写入的数据超过该 Size(这个 Size 可以借助 ulimit 获取)时,同时读端并未读取数据,写段就会被阻塞。总结一下:containerd-shim 启动 runc exec 去容器内执行用户命令,runc exec 启动 runc init 进入容器时,由于往 2 号 FD 写数据超过限制大小而被阻塞。当最底层的 runc init 被阻塞时,造成了调用链路上所有进程都被阻塞:

1
runc init → runc exec → containerd-shim exec → containerd exec → dockerd exec

问题定位至此,我们已经了解了 docker hang 死的原因。但是,现在我们还有如下问题并未解决:

  • 为什么 runc init 会往 2 号 FD (对应 go 语言的 os.Stderr) 中写入超过 linux pipe 大小限制的数据?
  • 为什么 runc init 出现问题只发生在特定容器?

如果常态下 runc init 就需要往 os.Stdout 或者 os.Stderr 中写入很多数据,那么所有容器的创建都应该有问题。所以,我们可以确定是该异常容器出现了什么未知原因,导致 runc init 非预期往 os.Stderr 写入了大量数据。而这被写入的数据就很有可能揭示非预期的异常。

所以,我们需要获取 runc init 当前正在写入的数据。由于 runc init 的 2 号 FD 是个匿名 pipe,我们无法使用常规文件读取的方式获取 pipe 内的数据。这里感谢鹤哥趟坑,找到了一种读取匿名 pipe 内容的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# cat /proc/15638/fd/2
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f512b7365f7 m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f512b7365f7
stack: frame={sp:0x7ffe1121a658, fp:0x0} stack=[0x7ffe0ae1bb28,0x7ffe1121ab50)
00007ffe1121a558: 00007ffe1121a6d8 00007ffe1121a6b0
00007ffe1121a568: 0000000000000001 00007f512c527660
00007ffe1121a578: 00007f512c54d560 00007f512c54d208
00007ffe1121a588: 00007f512c333e6f 0000000000000000
00007ffe1121a598: 00007f512c527660 0000000000000005
00007ffe1121a5a8: 0000000000000000 0000000000000001
00007ffe1121a5b8: 00007f512c54d208 00007f512c528000
00007ffe1121a5c8: 00007ffe1121a600 00007f512b704b0c
00007ffe1121a5d8: 00007f512b7110c0 0000000000000000
00007ffe1121a5e8: 00007f512c54d560 00007ffe1121a620
00007ffe1121a5f8: 00007ffe1121a610 000000000f11ed7d
00007ffe1121a608: 00007f512c550153 00000000ffffffff
00007ffe1121a618: 00007f512c550a9b 00007f512b707d00
00007ffe1121a628: 00007f512babc868 00007f512c9e9e5e
00007ffe1121a638: 00007f512d3bb080 00000000000000f1
00007ffe1121a648: 0000000000000011 0000000000000000
00007ffe1121a658: <00007f512b737ce8 0000000000000020
00007ffe1121a668: 0000000000000000 0000000000000000
00007ffe1121a678: 0000000000000000 0000000000000000
00007ffe1121a688: 0000000000000000 0000000000000000
00007ffe1121a698: 0000000000000000 0000000000000000
00007ffe1121a6a8: 0000000000000000 0000000000000000
00007ffe1121a6b8: 0000000000000000 0000000000000000
00007ffe1121a6c8: 0000000000000000 0000000000000000
00007ffe1121a6d8: 0000000000000000 00007f512babc868
00007ffe1121a6e8: 00007f512c9e9e5e 00007f512d3bb080
00007ffe1121a6f8: 00007f512c33f260 00007f512babc1c0
00007ffe1121a708: 00007f512babc1c0 0000000000000001
00007ffe1121a718: 00007f512babc243 00000000000000f1
00007ffe1121a728: 00007f512b7787ec 0000000000000001
00007ffe1121a738: 00007f512babc1c0 000000000000000a
00007ffe1121a748: 00007f512b7e8a4d 000000000000000a
runtime: unknown pc 0x7f512b7365f7
stack: frame={sp:0x7ffe1121a658, fp:0x0} stack=[0x7ffe0ae1bb28,0x7ffe1121ab50)
00007ffe1121a558: 00007ffe1121a6d8 00007ffe1121a6b0
00007ffe1121a568: 0000000000000001 00007f512c527660
00007ffe1121a578: 00007f512c54d560 00007f512c54d208
00007ffe1121a588: 00007f512c333e6f 0000000000000000
00007ffe1121a598: 00007f512c527660 0000000000000005
00007ffe1121a5a8: 0000000000000000 0000000000000001
00007ffe1121a5b8: 00007f512c54d208 00007f512c528000
00007ffe1121a5c8: 00007ffe1121a600 00007f512b704b0c
00007ffe1121a5d8: 00007f512b7110c0 0000000000000000
00007ffe1121a5e8: 00007f512c54d560 00007ffe1121a620
00007ffe1121a5f8: 00007ffe1121a610 000000000f11ed7d
00007ffe1121a608: 00007f512c550153 00000000ffffffff
00007ffe1121a618: 00007f512c550a9b 00007f512b707d00
00007ffe1121a628: 00007f512babc868 00007f512c9e9e5e
00007ffe1121a638: 00007f512d3bb080 00000000000000f1
00007ffe1121a648: 0000000000000011 0000000000000000
00007ffe1121a658: <00007f512b737ce8 0000000000000020
00007ffe1121a668: 0000000000000000 0000000000000000
00007ffe1121a678: 0000000000000000 0000000000000000
00007ffe1121a688: 0000000000000000 0000000000000000
00007ffe1121a698: 0000000000000000 0000000000000000
00007ffe1121a6a8: 0000000000000000 0000000000000000
00007ffe1121a6b8: 0000000000000000 0000000000000000
00007ffe1121a6c8: 0000000000000000 0000000000000000
00007ffe1121a6d8: 0000000000000000 00007f512babc868
00007ffe1121a6e8: 00007f512c9e9e5e 00007f512d3bb080
00007ffe1121a6f8: 00007f512c33f260 00007f512babc1c0
00007ffe1121a708: 00007f512babc1c0 0000000000000001
00007ffe1121a718: 00007f512babc243 00000000000000f1
00007ffe1121a728: 00007f512b7787ec 0000000000000001
00007ffe1121a738: 00007f512babc1c0 000000000000000a
00007ffe1121a748: 00007f512b7e8a4d 000000000000000a

goroutine 1 [running, locked to thread]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:363 fp=0xc4200a3ed0 sp=0xc4200a3ec8 pc=0x7f512c7281d0
runtime.startTheWorld()
/usr/local/go/src/runtime/proc.go:978 +0x2f fp=0xc4200a3ee8 sp=0xc4200a3ed0 pc=0x7f512c70221f
runtime.GOMAXPROCS(0x1, 0xc42013d9a0)
/usr/local/go/src/runtime/debug.go:30 +0xa0 fp=0xc4200a3f10 sp=0xc4200a3ee8 pc=0x7f512c6d9810
main.init.0()
/go/src/github.com/opencontainers/runc/init.go:14 +0x61 fp=0xc4200a3f30 sp=0xc4200a3f10 pc=0x7f512c992801
main.init()
<autogenerated>:1 +0x624 fp=0xc4200a3f88 sp=0xc4200a3f30 pc=0x7f512c9a1014
runtime.main()
/usr/local/go/src/runtime/proc.go:186 +0x1d2 fp=0xc4200a3fe0 sp=0xc4200a3f88 pc=0x7f512c6ff962
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200a3fe8 sp=0xc4200a3fe0 pc=0x7f512c72ad71

goroutine 6 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

rax 0x0
rbx 0x7f512babc868
rcx 0xffffffffffffffff
rdx 0x6
rdi 0x271
rsi 0x271
rbp 0x7f512c9e9e5e
rsp 0x7ffe1121a658
r8 0xa
r9 0x7f512c524740
r10 0x8
r11 0x206
r12 0x7f512d3bb080
r13 0xf1
r14 0x11
r15 0x0
rip 0x7f512b7365f7
rflags 0x206
cs 0x33
fs 0x0
gs 0x0
exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer"

额,runc init 因资源不足创建线程失败???这种输出显然不是 runc 的输出,而是 go runtime 非预期的输出内容。那么资源不足,究竟是什么资源类型资源不足呢?我们在结合 /var/log/message 日志分析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Jun 17 03:18:17 host-xx kernel: runc:[2:INIT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=997
Jun 17 03:18:17 host-xx kernel: CPU: 14 PID: 12788 Comm: runc:[2:INIT] Tainted: G W OE ------------ T 3.10.0-514.16.1.el7.stable.v1.4.x86_64 #1
Jun 17 03:18:17 host-xx kernel: Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
Jun 17 03:18:17 host-xx kernel: ffff88103841dee0 00000000c4394691 ffff880263e4bcb8 ffffffff8168863d
Jun 17 03:18:17 host-xx kernel: ffff880263e4bd50 ffffffff81683585 ffff88203cc5e300 ffff880ee02b2380
Jun 17 03:18:17 host-xx kernel: 0000000000000001 0000000000000000 0000000000000000 0000000000000046
Jun 17 03:18:17 host-xx kernel: Call Trace:
Jun 17 03:18:17 host-xx kernel: [<ffffffff8168863d>] dump_stack+0x19/0x1b
Jun 17 03:18:17 host-xx kernel: [<ffffffff81683585>] dump_header+0x85/0x27f
Jun 17 03:18:17 host-xx kernel: [<ffffffff81185b06>] ? find_lock_task_mm+0x56/0xc0
Jun 17 03:18:17 host-xx kernel: [<ffffffff81185fbe>] oom_kill_process+0x24e/0x3c0
Jun 17 03:18:17 host-xx kernel: [<ffffffff81093c2e>] ? has_capability_noaudit+0x1e/0x30
Jun 17 03:18:17 host-xx kernel: [<ffffffff811f4d91>] mem_cgroup_oom_synchronize+0x551/0x580
Jun 17 03:18:17 host-xx kernel: [<ffffffff811f41b0>] ? mem_cgroup_charge_common+0xc0/0xc0
Jun 17 03:18:17 host-xx kernel: [<ffffffff81186844>] pagefault_out_of_memory+0x14/0x90
Jun 17 03:18:17 host-xx kernel: [<ffffffff816813fa>] mm_fault_error+0x68/0x12b
Jun 17 03:18:17 host-xx kernel: [<ffffffff81694405>] __do_page_fault+0x395/0x450
Jun 17 03:18:17 host-xx kernel: [<ffffffff816944f5>] do_page_fault+0x35/0x90
Jun 17 03:18:17 host-xx kernel: [<ffffffff81690708>] page_fault+0x28/0x30
Jun 17 03:18:17 host-xx kernel: memory: usage 3145728kB, limit 3145728kB, failcnt 14406932
Jun 17 03:18:17 host-xx kernel: memory+swap: usage 3145728kB, limit 9007199254740988kB, failcnt 0
Jun 17 03:18:17 host-xx kernel: kmem: usage 3143468kB, limit 9007199254740988kB, failcnt 0
Jun 17 03:18:17 host-xx kernel: Memory cgroup stats for /kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 17 03:18:17 host-xx kernel: Memory cgroup stats for /kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda/b761e05249245695278b3f409d2d6e5c6a5bff6995ff0cf44d03af4aa9764a30: cache:0KB rss:40KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 17 03:18:17 host-xx kernel: Memory cgroup stats for /kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda/1d1750ecc627cc5d60d80c071b2eb4d515ee8880c5b5136883164f08319869b0: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 17 03:18:17 host-xx kernel: Memory cgroup stats for /kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5: cache:0KB rss:2220KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:2140KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 17 03:18:17 host-xx kernel: Memory cgroup stats for /kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5/super-agent: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 17 03:18:17 host-xx kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 17 03:18:17 host-xx kernel: [30598] 0 30598 255 1 4 0 -998 pause
Jun 17 03:18:17 host-xx kernel: [11680] 0 11680 164833 1118 20 0 997 dockerinit
Jun 17 03:18:17 host-xx kernel: [12788] 0 12788 150184 1146 23 0 997 runc:[2:INIT]
Jun 17 03:18:17 host-xx kernel: oom-kill:,cpuset=bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5,mems_allowed=0-1,oom_memcg=/kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda,task_memcg=/kubepods/burstable/pod6c4333b3-a663-11ea-b39f-6c92bf85beda/bbd5e4b5f9c13666dd0ec7ff7afb2c4c2b0ede40a4adf1de43cc31c606f283f5,task=runc:[2:INIT],pid=12800,uid=0
Jun 17 03:18:17 host-xx kernel: Memory cgroup out of memory: Kill process 12800 (runc:[2:INIT]) score 997 or sacrifice child
Jun 17 03:18:17 host-xx kernel: Killed process 12788 (runc:[2:INIT]) total-vm:600736kB, anon-rss:3296kB, file-rss:276kB, shmem-rss:1012kB

/var/log/message 中可以找到该容器在大约 1 个月前大量的 OOM 日志记录,同时时间也基本匹配。

所以总结下,在一个非常关键的时间节点,runc init 由于内存资源不足,创建线程失败,触发 go runtime 的非预期输出,进而造成 runc init 阻塞在写 pipe 操作。

定位至此,问题的全貌已经基本描述清楚。但是我们还有一个疑问,既然 runc init 再往 pipe 中写数据,难道没有其他进程来读取这个内容吗?

大家还记得上面 lsof 执行的结果吗?有心人一定发现了该 pipe 的读端是谁了,对,就是 containerd-shim,对应的 pipe 的 inode 编号为 1070361747。那么,为什么 containerd-shim 没有来读 pipe 里面的内容呢?我们结合代码来分析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func (e *execProcess) start(ctx context.Context) (err error) {
......
if err := e.parent.runtime.Exec(ctx, e.parent.id, e.spec, opts); err != nil { // 执行 runc init
close(e.waitBlock)
return e.parent.runtimeError(err, "OCI runtime exec failed")
}
......
else if !e.stdio.IsNull() {
fifoCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
defer cancel()

if err := copyPipes(fifoCtx, e.io, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg, &copyWaitGroup); err != nil { // 读 pipe
return errors.Wrap(err, "failed to start io pipe copy")
}
}
......
}

func (r *Runc) Exec(context context.Context, id string, spec specs.Process, opts *ExecOpts) error {
......
cmd := r.command(context, append(args, id)...)
if opts != nil && opts.IO != nil {
opts.Set(cmd)
}
......
ec, err := Monitor.Start(cmd)
......
status, err := Monitor.Wait(cmd, ec)
......
}

额,containerd-shim 的设计是,等待 runc init 执行完成之后,再来读取 pipe 中的内容。但是此时的 runc init 由于非预期的写入数据量比较大,被阻塞在了写 pipe 操作处。完美的死锁。

终于,本次 docker hang 死问题的核心脉络都已清楚。接下来我们聊聊怎么解决方案。

4 解决方案

当大家了解了 docker hang 死的成因之后,我们可以针对性的提出如下解决办法。

4.1 最直观的办法

既然 docker exec 可能会引起 docker hang 死,那么我们禁用系统中所有的 docker exec 操作即可。最典型的是 kubelet 的 probe,当前我们默认给所有 Pod 添加了 ReadinessProbe,并且是以 exec 的形式进入容器内执行命令。我们调整 kubelet 的探测行为,修改为 tcp 或者 http probe 即可。

这里虽然改动不大,但是涉及业务容器的改造成本太大了,如何迁移存量集群是个大问题。

4.2 最根本的办法

既然当前 containerd-shim 读 pipe 需要等待 runc exec 执行完毕,如果我们将读 pipe 的操作提前至 runc exec 命令执行之前,理论上也可以避免死锁。

同样。这种方案的升级成本太高了,升级 containerd-shim 时需要重启存量的所有容器,这个方案基本不可能通过。

4.3 最简单的办法

既然 runc init 阻塞在写 pipe,我们主动读取 pipe 内的内容,也能让 runc init 顺利退出。

在将本解决方案自动化的过程中,如何能够识别如 docker hang 死是由于写 pipe 导致的,是一个小小的挑战。但是相对于以上两种解决方案,我认为还是值得一试,毕竟影响面微乎其微。

来源:www.likakuli.com

微信订阅号

-------------本文结束感谢您的阅读-------------