下载
中文
注册

Device网络不通报错retcode 4

问题现象

多机场景下,HCCL Test工具执行时,报错“retcode: 4”,如下图所示:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
-bash: hccn: command not found 
[rootenode-9 hccl_test]f hccn_tool -i0 -ip -s address *.*.*.* netmask_255.255.255.0
[root@node-9 hccl_test]f mpirun -f hostfile -n 16 ./bin/all_reduce_test -b 8K -e 64M -f 2 -d fp32 -o sum -p 8

Authorized users only. All activities may be monitored and reported. 
the minbytes is 8192,maxbytes is 67108864,iters is 20,warmup_iters is5
hccl interface return erreturn err./opbase_test/hccl_allreduce_rootinfo_test.cc:135,retcode:4
hccl_opbaseexecutefailed,Detailed logs are stored in path:/root/ascend/log/This is anerrorin opbase_test_by_data_size.
hccl interface return errreturn errd/opbase_test/hccl_allreduce_rootinfo_test.cc:135,retcode:4
hccl_op_base execute failed,Detailed logs are stored in path:/root/ascend/log/This is an error in opbase_test_by_data_size.
hccl interface return erreturn err./opbase_test/hccl_allreduce_rootinfo_test.cc:l35,retcode:0
hccl_op_base_execute failed, Detailed logs are stored in path:/root/ascend/log/This_is an_error in opbase_test_by_data_size.

原因分析

Device网络不通,导致建链失败。

解决步骤

在Host侧执行如下命令,分别ping每张卡,确认是否网络连通。

hccn_tool -i 0 -ping -g address 192.169.150.60

同平面卡需要两两互通,即所有机器的同号卡间要互相ping通(如:两两之间0卡ping0卡,1卡ping1卡,以此类推),同时如果为单机16卡,所有机器的0卡和8卡,1卡和9卡,2卡和11卡,以此类推都需要两两互通。

命令如下:

hccn_tool -i 0 -ping -g address 192.169.150.60 #当前机器的0卡ping另外一台机器的一张卡的device ip。

若修改了IP地址,但未对应修改gateway,也会导致device之间不通,ip、netmask、gateway需要对应配置。