容器技术回顾 - 从一个“D”状态容器进程回顾 cgroup freezer 子系统

2024-01-10 14:18#1 标记1

今天遇到一个问题，出现了一个“D”状态的进程。当然了偶尔处于 D 状态是正常的，但是长时间处于“D”状态就有问题了。关于 Linux “D” 状态进程，可以参考：一文让你应对Linux 进程“D”状态。
对于“D”状态的进程，我们直接用 kill -9 其实是没有任何作用的，因为它会忽略所有信号。但是抱着侥幸的心态，想看看强杀 Pod 会有什么效果：
kubectl delete pod <pod-name> --force --grace-period=0
结果可想而知，Pod 最终变成了 Error 状态。
处理过程
既然不能强杀，那我们就按部就班进行分析。
首先我们需要查找处于 D 状态的进程 pid：
ps -eo pid,stat | awk '$2 ~ /^D/'
接下来通过 proc 文件系统，查看进程的 stack，用于定位被 block 的系统调用：
cat /proc/<pid>/stack

这里有一个关键的系统调用 __refrigerator，查看相关文档（https://www.kernel.org/doc/html/next/power/freezing-of-tasks.html），有如下关键信息：
I. What is the freezing of tasks?
The freezing of tasks is a mechanism by which user space processes and some kernel threads are controlled during hibernation or system-wide suspend (on some architectures).
......
__refrigerator() must not be called directly. Instead, use the try_to_freeze() function (defined in include/linux/freezer.h), that checks if the task is to be frozen and makes the task enter __refrigerator().
此时我突然想起来 cgroup 中有一个 freezer 子系统，立马翻出文档（https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt）：
the cgroup freezer is useful to batch job management system which start and stop sets of tasks in order to schedule the resources of a machine according to the desires of a system administrator. This sort of program is often used on HPC clusters to schedule access to the cluster as a whole. The cgroup freezer uses cgroups to describe the set of tasks to be started/stopped by the batch job management system. It also provides a means to start and stop the tasks composing the job.
......
The following cgroupfs files are created by cgroup freezer.* freezer.state: Read-write. When read, returns the effective state of the cgroup - "THAWED", "FREEZING" or "FROZEN". This is the combined self and parent-states. If any is freezing, the cgroup is freezing (FREEZING or FROZEN). FREEZING cgroup transitions into FROZEN state when all tasks belonging to the cgroup and its descendants become frozen. Note that a cgroup reverts to FREEZING from FROZEN after a new task is added to the cgroup or one of its descendant cgroups until the new task is frozen. When written, sets the self-state of the cgroup. Two values are allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup, if not already freezing, enters FREEZING state along with all its descendant cgroups. If THAWED is written, the self-state of the cgroup is changed to THAWED. Note that the effective state may not change to THAWED if the parent-state is still freezing. If a cgroup's effective state becomes THAWED, all its descendants which are freezing because of the cgroup also leave the freezing state.* freezer.self_freezing: Read only. Shows the self-state. 0 if the self-state is THAWED; otherwise, 1. This value is 1 iff the last write to freezer.state was "FROZEN".* freezer.parent_freezing: Read only. Shows the parent-state. 0 if none of the cgroup's ancestors is frozen; otherwise, 1.The root cgroup is non-freezable and the above interface files don'texist.* Examples of usage : # mkdir /sys/fs/cgroup/freezer # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer # mkdir /sys/fs/cgroup/freezer/0 # echo $some_pid > /sys/fs/cgroup/freezer/0/tasksto get status of the freezer subsystem : # cat /sys/fs/cgroup/freezer/0/freezer.state THAWEDto freeze all tasks in the container : # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state FREEZING # cat /sys/fs/cgroup/freezer/0/freezer.state FROZENto unfreeze all tasks in the container : # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state THAWEDThis is the basic mechanism which should do the right thing for user space task
in a simple scenario.
FREEZING 不是一个常态，它是当前 CGroup（或其子CGroup）一组任务将要转换到 FROZEN 状态的一种中间状态。同时，如果当前或子 CGroup 有新任务加入，状态会从 FROZEN 返回到 FRZEEING，直到任务被冻结。
只有 FROZEN 和 THAWED 两个状态是写有效的。如果写入 FROZEN，当CGroup 没有完全进入冻结状态，包括其所有子 CGroup 都会进入 FREEZING 状态。
如果写入 THAWED，当前的 CGroup 状态就会变成 THAWED。有一种例外是如果父 CGroup 还是被冻结，则不会变成 THAWED。如果一个 CGroup 的有效状态变成 THAWED，因当前 CGroup 造成的冻结都会停止，并离开冻结状态。
接下来我们就写一个脚本，用于检查容器进程的是否是被 cgroup freezer 系统给暂停了。
#!/bin/bash# 检查是否提供了正确数量的参数if [ "$#" -ne 1 ]; then echo "请提供一个 freezer cgroup 路径或 PID 作为参数" exit 1fi# 获取输入的 freezer cgroup 路径或 PIDinput=$1# 如果输入是数字，认为是 PID，并获取对应的 freezer cgroup 路径if [[ "$input" =~ ^[0-9]+$ ]]; then pid=$input if [ ! -d "/proc/$pid" ]; then echo "PID $pid 不存在" exit 1 fi cgroup_path=$(grep -m 1 -E '^[[:digit:]]+:freezer:' /proc/$pid/cgroup | cut -d: -f3)else cgroup_path=$inputfi# 在路径前添加 "/sys/fs/cgroup"cgroup_path="/sys/fs/cgroup/freezer${cgroup_path}"# 进入指定的 freezer cgroupcd "$cgroup_path" || exit 1# 获取当前的 freezer.parent_freezing、freezer.state 和 freezer.self_freezing 的值parent_freezing=$(cat freezer.parent_freezing)current_state=$(cat freezer.state)self_freezing=$(cat freezer.self_freezing)# 打印当前状态echo "当前 freezer.parent_freezing: $parent_freezing"echo "当前 freezer.state: $current_state"echo "当前 freezer.self_freezing: $self_freezing"
输出结果如下：

果然是被 freeze 了。那我就改进一下上面的脚本，将处于自我冻结的进程进行解冻：
#!/bin/bash# 检查是否提供了正确数量的参数if [ "$#" -ne 1 ]; then echo "请提供一个 freezer cgroup 路径或 PID 作为参数" exit 1fi# 获取输入的 freezer cgroup 路径或 PIDinput=$1# 如果输入是数字，认为是 PID，并获取对应的 freezer cgroup 路径if [[ "$input" =~ ^[0-9]+$ ]]; then pid=$input if [ ! -d "/proc/$pid" ]; then echo "PID $pid 不存在" exit 1 fi cgroup_path=$(grep -m 1 -E '^[[:digit:]]+:freezer:' /proc/$pid/cgroup | cut -d: -f3)else cgroup_path=$inputfi# 在路径前添加 "/sys/fs/cgroup"cgroup_path="/sys/fs/cgroup/freezer${cgroup_path}"# 进入指定的 freezer cgroupcd "$cgroup_path" || exit 1# 获取当前的 freezer.parent_freezing、freezer.state 和 freezer.self_freezing 的值parent_freezing=$(cat freezer.parent_freezing)current_state=$(cat freezer.state)self_freezing=$(cat freezer.self_freezing)# 打印当前状态echo "当前 freezer.parent_freezing: $parent_freezing"echo "当前 freezer.state: $current_state"echo "当前 freezer.self_freezing: $self_freezing"# 如果 freezer.state 是 FROZEN 且 freezer.self_freezing 是 1，则将 freezer.state 设置为 THAWEDif [ "$current_state" == "FROZEN" ] && [ "$self_freezing" -eq 1 ]; then echo "Detected FROZEN state with self_freezing = 1. Thawing..." echo "THAWED" > freezer.state echo "freezer.state 设置为 THAWED"fi# 打印最终的 freezer.stateecho "最终 freezer.state: $(cat freezer.state)"
输出结果：

当我们再查看这个进程状态的时候，发现进程退出了，Pod 也被删除了，一切恢复正常！
延伸思考
翻阅了 kubectl 的命令手册，并没有一个命令来停止 Pod。
kubectl --helpkubectl controls the Kubernetes cluster manager. Find more information at: https://kubernetes.io/docs/reference/kubectl/Basic Commands (Beginner): create Create a resource from a file or from stdin expose Take a replication controller, service, deployment or pod and expose it as a new Kubernetes service run 在集群上运行特定镜像 set 为对象设置指定特性Basic Commands (Intermediate): explain Get documentation for a resource get 显示一个或多个资源 edit 编辑服务器上的资源 delete Delete resources by file names, stdin, resources and names, or by resources and label selectorDeploy Commands: rollout Manage the rollout of a resource scale Set a new size for a deployment, replica set, or replication controller autoscale Auto-scale a deployment, replica set, stateful set, or replication controllerCluster Management Commands: certificate Modify certificate resources cluster-info Display cluster information top Display resource (CPU/memory) usage cordon 标记节点为不可调度 uncordon 标记节点为可调度 drain 清空节点以准备维护 taint 更新一个或者多个节点上的污点Troubleshooting and Debugging Commands: describe 显示特定资源或资源组的详细信息 logs 打印 Pod 中容器的日志 attach 挂接到一个运行中的容器 exec 在某个容器中执行一个命令 port-forward 将一个或多个本地端口转发到某个 Pod proxy 运行一个指向 Kubernetes API 服务器的代理 cp Copy files and directories to and from containers auth Inspect authorization debug Create debugging sessions for troubleshooting workloads and nodes events List eventsAdvanced Commands: diff Diff the live version against a would-be applied version apply Apply a configuration to a resource by file name or stdin patch Update fields of a resource replace Replace a resource by file name or stdin wait Experimental: Wait for a specific condition on one or many resources kustomize Build a kustomization target from a directory or URLSettings Commands: label 更新某资源上的标签 annotate 更新一个资源的注解 completion Output shell completion code for the specified shell (bash, zsh, fish, or powershell)Other Commands: api-resources Print the supported API resources on the server api-versions Print the supported API versions on the server, in the form of "group/version" config 修改 kubeconfig 文件 plugin Provides utilities for interacting with plugins version 输出客户端和服务端的版本信息Usage: kubectl [flags] [options]Use "kubectl <command> --help" for more information about a given command.Use "kubectl options" for a list of global command-line options (applies to all commands).
而 docker 提供了一个 pause 命令，官方文档（https://docs.docker.com/engine/reference/commandline/pause/）如下描述：docker pause
Pause all processes within one or more containersUsage
docker pause CONTAINER [CONTAINER...]Description
The docker pause command suspends all processes in the specified containers. On Linux, this uses the freezer cgroup. Traditionally, when suspending a process the SIGSTOP signal is used, which is observable by the process being suspended. With the freezer cgroup the process is unaware, and unable to capture, that it is being suspended, and subsequently resumed. On Windows, only Hyper-V containers can be paused.
See the freezer cgroup documentation for further details.
由此可见，很有可能是因为某人执行了 docker pause 命令，从而导致容器处于 freeze 状态。
相关 issue
https://github.com/opencontainers/runc/issues/2105