防火墙未关闭
问题现象
测试报错如下,查询机器防火墙发现防火墙未关闭
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | [root@node-87-66 hccl_test]# mpirun -f hostfile -n 16 ./bin/all_reduce_test -b 8K -e 4G -f 2 -d fp32 -p 8 the minbytes is 8192, maxbytes is 4294967296, iters is 20, warmup_iters is 5 Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed MPIR_Barrier_impl(332)..........: Failure during collective MPIR_Barrier_impl(327)..........: MPIR_Barrier(292)...............: MPIR_Barrier_intra(150).........: barrier_smp_intra(96)...........: MPIR_Barrier_impl(332)..........: Failure during collective MPIR_Barrier_impl(327)..........: MPIR_Barrier(292)...............: MPIR_Barrier_intra(169).........: MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 8 barrier_smp_intra(111)..........: MPIR_Bcast_impl(1452)...........: MPIR_Bcast(1476)................: MPIR_Bcast_intra(1287)..........: MPIR_Bcast_binomial(310)........: Failure during collective |
原因分析
发生此问题的原因一般是防火墙未关闭。
不同系统防火墙查询命令略有不同,例如:
systemctl status firewalld
父主题: HCCL Test常见问题总结