卡间集合通信参数不一致(EI0005)
问题现象
执行日志报错:EI0005“The arguments for collective communication are inconsistent between ranks”,如下所示。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | custom_group :None test_type : sum dtype :float32 data :1024 fusion :0 fusion_id :1 fusion_num :2 iter :1 profiling :false para_err_type :1 pid :1052803 ranksize is 8, rankid is 4. time start: 2024-04-24 06:32:20.702705 ERROR : GeOp3_OGEOP::::DoRunAsync Failed Error Message is: EI0005: 2024-04-24-06:32:27.781.599 The arguments for collective communication are inconsistent between ranks:tag[HcomAllReduce_6629421139219749105_0],parameter[count],local[16512],remote [8320] Solution: Check whether the training script and ranktable of each NPU are consistent. TraceBack (most recent call last): Transport init error. Reason: [Create] [DestLink]Create Dest error! creakLink para:rank[5]-localUserrank[4]-localIpAddr[192.1.1. 243], dst_rank[6]-remoteUserrank[7]-remote_ip_addr[192.1.1.243] Transport init error. Reason: [Create] [DestLink]Create Dest error! creakLink para:rank[5]-localUserrank[4]-localIpAddr[192.1.1. 243], dst_rank[4]-remoteUserrank[5]-remote_ip_addr[192.1.1.243] call hccl op:HcomAllReduce(HcomAllReduce) load task fail[FUNC:Distribute][FILE:hccl_task_info.cc] [LINE:329] [[{[node Ge0p3_0]}]] train fail pid is 1052803 |
原因分析
该报错是本端和对端的校验参数不一致导致,问题一般出现在建链阶段。建链阶段本端接收到对端发过来的校验帧,然后和本端数据进行对比校验,确认数据是否一致。
通常原因是上层框架或用户调用传参有问题(取决于HCCL上层为框架还是用户直调),需要结合报错提示的“tag”确定出错算子,“parameter”确定出错算子参数。
父主题: HCCL常见问题总结