NPU Exporter检查动态路径失败,日志出现check uid or mode failed
2025/01/26
32
问题信息
问题来源 | 产品大类 | 产品子类 | 关键字 |
---|---|---|---|
官方 | 安装部署 | MindCluster集群调度 | NPU Exporter、check uid or mode failed |
问题现象描述
- 执行kubectl get pod -A | grep npu-exporter命令,显示NPU Exporter的容器镜像启动失败。
npu-exporter npu-exporter-rtgpg 0/1 CrashLoopBackOff 2 39s
- 执行kubectl logs -fn npu-exporter npu-exporter-rtgpg查看报错信息,日志显示信息如下。
[INFO] 2023/10/24 09:55:04.454169 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/10/24 09:55:04.454389 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/10/24 09:55:04.454607 1 npu-exporter/main.go:325 npu exporter starting and the version is v{version}_linux-aarch64 2023/10/24 09:55:04 command exec failed, &exec.ExitError{ProcessState:(*os.ProcessState)(0x4000495c80), Stderr:[]uint8(nil)} [ERROR] 2023/10/24 09:55:04.458386 1 devmanager/devmanager.go:83 deviceManager init failed, prepare dcmi failed, err: &errors.errorString{s:"cannot found valid driver lib, fromEnv: lib path is invalid, [/usr/local: check uid or mode failed; /usr/local: check uid or mode failed;], fromLdCmd: can't find valid lib"} [ERROR] 2023/10/24 09:55:04.458589 1 collector/npu_collector.go:136 new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm [ERROR] 2023/10/24 09:55:04.458678 1 npu-exporter/main.go:329 register prometheus failed
原因分析
容器镜像内/usr/local目录权限不正确。
解决措施
- 在NPU Exporter的安装路径下,执行以下命令,获取容器镜像ID。
docker ps -a | grep npu-exporter
回显示例如下,15bca02e16e9即为所需容器镜像ID。37a084a19207 15bca02e16e9 "/bin/bash -c -- 'um…" 25 seconds ago Exited (0) 24 seconds ago k8s_npu-exporter_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_4 2dbb86d6619f k8s.gcr.io/pause:3.2 "/pause" About a minute ago Up About a minute k8s_POD_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_0
- 执行以下命令,查看镜像信息。
docker images | grep 15bca02e16e9
回显示例如下。
npu-exporter v{version} 15bca02e16e9 3 minutes ago 93.2MB
- 依次执行以下命令,检查问题目录的权限。
docker run -it 15bca02e16e9 bash ll /usr/
回显示例如下,local/目录即为权限不正确目录。total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxrwxrwx 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
- 执行以下命令,修改目录权限。
root@493a58982af9:/# chmod 755 /usr/local root@493a58982af9:/# ll /usr/
493a58982af9为容器ID。
回显示例如下,表示权限已正确设置。total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxr-xr-x 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
- 执行以下命令,退出容器。
root@493a58982af9:/# exit
- 使用容器ID和镜像名加tag,提交容器修改。
docker commit 493a58982af9 npu-exporter:v{version}
回显示例如下。sha256:34a360670e213cc8817b352a055969e620ed15ac7d26dcbO5e391f0a4ad2682a
- 重新查看NPU Exporter的容器镜像状态。
kubectl get po -A | grep npu-exporter
可以等待容器自动重启或者手动强制重启,查看容器镜像状态。
回显示例如下,表示NPU Exporter的容器镜像已正常运行。
npu-exporter npu-exporter-rtgpg 1/1 Running 7 10m
- 执行以下命令,删除已创建的容器副本。
docker rm 493a58982af9
回信示例如下。493a58982af9
本页内容