参与集合通信的服务器TLS信息不一致,HCCL初始化失败
问题现象
分布式训练场景下,集合通信建链失败,HCCL关键日志信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | [EVENT] HCCP(23457,all_reduce_test):2023-10-16-08:39:56.291.195 [ra_host.c:1672]tid:23468,ra_socket_white_list_add(1672) : Input parameters: phy_id[0], local_ip[192.168.100.101], num[1] [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.748 [comm_base.cc:967][23468] _________________________LINK_ERROR_INFO___________________________ [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.771 [comm_base.cc:968][23468] | comm error, device[0] num[1] [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.777 [comm_base.cc:969][23468] | dest_ip(rank_id) | src_ip(rank_id) | Role | Status | [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.783 [comm_base.cc:970][23468] |--------------------|--------------------|----------|------------| [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.797 [comm_base.cc:1018][23468] | 192.168.100.100(1) | 192.168.100.101(0) | server | no connect | [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.804 [comm_base.cc:983][23468] ___________________________________________________________________ [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.809 [comm_base.cc:984][23468]the connection failure between this device and target device may be due to the following reasons: [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.815 [comm_base.cc:985][23468]1. the connection between this device and the target device is abnormal. [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.820 [comm_base.cc:986][23468]2. an exception occurred at the target devices. [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.825 [comm_base.cc:988][23468]3. the time difference between the execution of hcom on this device and the target device exceeds the timeout threshold. make sure this by keyworld [Entry-] [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.830 [comm_base.cc:990][23468]4. the behavior of executing the calculation graph on this device and the target device is inconsistent. [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.836 [comm_base.cc:991][23468]5. The TLS switch is inconsistent, or the TLS certificate expires. [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.841 [comm_base.cc:992][23468]6. If src and dst IP address can be pinged, check whether the device IP address conflicts. [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.847 [comm_base.cc:437][23468][Get][RaSocket]in comm, get rasocket error role[0], rank[0], num[1], goten[0], timeout[120] [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.853 [comm_base.cc:768][23468]call trace: hcclRet -> 9 [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.861 [comm_base.cc:827][23468]call trace: hcclRet -> 9 [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.867 [comm_base.cc:199][23468]call trace: hcclRet -> 9 [ERROR] HCCL(23457,all_reduce_test):2023-10-16-08:41:56.291.873 [comm_base.cc:82][23468]call trace: hcclRet -> 9 |
查询HCCP Debug日志信息,关键报错信息如下:
1 | [ERROR] HCCP(6594,hccp_service.bin):2023-10-16-08:39:56.335.828 [rs_ssl.c:206]tid:6610,rs_ssl_err_string(206) ; ssl fd 42 err 1 errno 0, err code 167773208, err msg error:0A000418:SSL routines::tlsv1 alert unknown ca, The possible cause is that the TLS switch is inconsistent. |
原因分析
参与集合通信的各服务器TLS状态开关不一致,或者当TLS状态开关统一打开时,TLS证书信息不一致。
解决方法
- 查询集合通信的各服务器TLS状态开关。
在服务器中执行如下命令,获取TLS开关使能状态。
hccn_tool -i <device_id> -tls -g [host]
其中<device_id>为Device设备的逻辑ID,您也可以通过如下for语句,一次性查询所有Device设备的TLS信息。
for i in `seq 0 7`; do hccn_tool -i $i -tls -g; done # 0,7分别为需要查询的Device ID的起始与结束值。
回显信息如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
dev_id:0, tls switch[0](0:disable, 1:enable), tls alarm time threshold[60]days dev_id:0, [pub cert] info: issuer[/C=CN/ST=GD/O=HUAWEI/OU=2012/CN=2_1thCA] start_time[Wed Feb 19 03:19:21 2020 GMT] end_time[Sat Feb 16 03:19:21 2030 GMT] dev_id:0, [ca1 cert] info: issuer[/C=CN/ST=GD/L=SZ/O=HUAWEI/CN=1thCA] start_time[Wed Feb 19 03:19:07 2020 GMT] end_time[Sat Feb 16 03:19:07 2030 GMT] dev_id:0, [ca2 cert] info: issuer[/C=CN/ST=GD/L=SZ/O=HUAWEI/CN=1thCA] start_time[Wed Feb 19 03:19:10 2020 GMT] end_time[Sat Feb 16 03:19:10 2030 GMT] dev_id:1, tls switch[0](0:disable, 1:enable), tls alarm time threshold[60]days dev_id:1, [pub cert] info: issuer[/C=CN/ST=GD/O=HUAWEI/OU=2012/CN=2_1thCA] start_time[Wed Feb 19 03:19:21 2020 GMT] end_time[Sat Feb 16 03:19:21 2030 GMT] dev_id:1, [ca1 cert] info: issuer[/C=CN/ST=GD/L=SZ/O=HUAWEI/CN=1thCA] start_time[Wed Feb 19 03:19:07 2020 GMT] end_time[Sat Feb 16 03:19:07 2030 GMT] dev_id:1, [ca2 cert] info: issuer[/C=CN/ST=GD/L=SZ/O=HUAWEI/CN=1thCA] start_time[Wed Feb 19 03:19:10 2020 GMT] end_time[Sat Feb 16 03:19:10 2030 GMT] ... ...
其中tls switch[0],代表TLS状态为关闭,switch[1]代表TLS状态为使能。
- 判断各服务器中所有Device的TLS状态开关是否一致。
- 若不一致,建议统一修改TLS状态为使能。若TLS开关关闭,集合通信时会存在信息被窃听、篡改、仿冒的风险。
hccn_tool -i <device_id> -tls -s enable 1
enable为使能开关,配置为1代表使能,配置为0代表关闭。
- 若一致且状态为使能,建议您继续执行3判断各节点的TLS证书信息是否一致。
- 若不一致,建议统一修改TLS状态为使能。若TLS开关关闭,集合通信时会存在信息被窃听、篡改、仿冒的风险。
- 查看所有服务器中各Device的TLS证书信息是否一致。
您可以通过1中的信息判断各Device TLS证书信息是否一致。若不一致,您可以通过如下命令替换证书套件。
hccn_tool -i 0 -tls -s path /root pri pri.pem pub pub.pem ca1 ca1.pem ca2 ca2.pem crl xxx.crl
-i为Device ID,-path为指定证书/私钥/吊销列表存放路径,pri为私钥名字,pub为设备证书文件名,ca1/ca2/crl分别为根证书、二级根证书、吊销列表文件名。
关于hccn_tool工具的更多用法及参数解释,可查看对应设备的《HCCN Tool 接口参考》。
《HCCN Tool接口参考》的获取方式为:单击Link进入企业业务网站的“昇腾计算 文档中心”,然后在“中心训练硬件”栏选择对应的硬件型号,单击进入对应的文档页面,即可在其中找到配套版本的《HCCN Tool接口参考》。
父主题: HCCP常见问题总结