执行训练模型脚本过程中入队阶段出现如下报错:
Error Message is EL9999: Inner Error! EL9999 [drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=58.[FUNC:MemQueueEnQueueBuff][FILE:npu_driver.cc][LINE:3205] TraceBack (most recent call last): rtMemQueueEnQueueBuff execute failed, reason=[invalid value][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] Fail to execute acltdtSendTensor, device is 0, name is 17125841410799885412[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
Error Message is EL9999: Inner Error! EL9999 [drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=27.[FUNC:MemQueueEnQueueBuff][FILE:npu_driver.cc][LINE:3205] TraceBack (most recent call last): rtMemQueueEnQueueBuff execute failed, reason=[hdc send msg fail][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] Fail to execute acltdtSendTensor, device is 0, name is 4174979144421111244[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
Error Message is EL9999: Inner Error! EL9999 [drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=74.[FUNC:MemQueueEnQueueBuff][FILE:npu_driver.cc][LINE:3190] TraceBack (most recent call last): rtMemQueueEnQueueBuff execute failed, reason=[driver error:copy data fail][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] Fail to execute acltdtSendTensor, device is 0, name is 12934931840515960683[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
Error Message is EL9999: Inner Error! EL9999 [drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=16.[FUNC:MemQueueEnQueueBuff][FILE:npu_driver.cc][LINE:3205] TraceBack (most recent call last): rtMemQueueEnQueueBuff execute failed, reason=[report timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] Fail to execute acltdtSendTensor, device is 0, name is 16561572617881319536[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
以上报错场景,可能的原因如下:
[ERROR] DRV(3078,python3):2023-04-27-20:04:09.700.666 [ascend][curpid: 3078, 3420][drv][queuemng][QueueIoctl 133]Ioctl failed. (cmd=40585102; error=0; ret=58) [ERROR] DRV(3078,python3):2023-04-27-20:04:09.700.674 [ascend][curpid: 3078, 3420][drv][queuemng][QueueSubmitEventSync 724]enqueue ioctl failed. (ret=58; event_id=4; gid=20; tid=0; timeout=5000ms; subevent_id=1). [ERROR] DRV(3078,python3):2023-04-27-20:04:09.700.677 [ascend][curpid: 3078, 3420][drv][queuemng][QueueSendQueueEventSyncTimeout 871]Submit event failed. (ret=58; devId=0; qid=1) [ERROR] RUNTIME(3078,python3):2023-04-27-20:04:09.700.686 [npu_driver.cc:3205]3420 MemQueueEnQueueBuff:report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(3078,python3):2023-04-27-20:04:09.700.688 [npu_driver.cc:3205]3420 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=58. [ERROR] RUNTIME(3078,python3):2023-04-27-20:04:09.700.721 [api_c.cc:3738]3420 rtMemQueueEnQueueBuff:ErrCode=107000, desc=[invalid value], InnerCode=0x7020025
[ERROR] DRV(253915,python3):2023-05-16-22:35:00.917.147 [ascend][curpid: 253915, 254303][drv][queuemng][QueueIoctl 133]Ioctl failed. (cmd=40585102; error=0; ret=27) [ERROR] DRV(253915,python3):2023-05-16-22:35:00.917.161 [ascend][curpid: 253915, 254303][drv][queuemng][QueueSubmitEventSync 724]enqueue ioctl failed. (ret=27; event_id=4; gid=20; tid=0; timeout=5000ms; subevent_id=90). [ERROR] DRV(253915,python3):2023-05-16-22:35:00.917.174 [ascend][curpid: 253915, 254303][drv][queuemng][QueueSendQueueEventSyncTimeout 871]Submit event failed. (ret=27; devId=0; qid=1) [ERROR] RUNTIME(253915,python3):2023-05-16-22:35:00.917.197 [npu_driver.cc:3205]254303 MemQueueEnQueueBuff:report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(253915,python3):2023-05-16-22:35:00.917.209 [npu_driver.cc:3205]254303 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=27. [ERROR] RUNTIME(253915,python3):2023-05-16-22:35:00.917.347 [api_c.cc:3783]254303 rtMemQueueEnQueueBuff:ErrCode=507051, desc=[hdc send msg fail], InnerCode=0x7110013
[ERROR] RUNTIME(179263,python3):2023-05-17-09:35:01.035.824 [npu_driver.cc:3205]179650 MemQueueEnQueueBuff:report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(179263,python3):2023-05-17-09:35:01.035.946 [npu_driver.cc:3205]179650 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=74. [ERROR] RUNTIME(179263,python3):2023-05-17-09:35:01.035.988 [api_c.cc:3783]179650 rtMemQueueEnQueueBuff:ErrCode=507052, desc=[driver error:copy data fail], InnerCode=0x7020024 [ERROR] RUNTIME(179263,python3):2023-05-17-09:35:01.035.993 [error_message_manage.cc:49]179650 FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(179263,python3):2023-05-17-09:35:01.035.998 [error_message_manage.cc:49]179650 FuncErrorReason:rtMemQueueEnQueueBuff execute failed, reason=[driver error:copy data fail]
[ERROR] DRV(2622,python3):2023-05-17-10:05:11.057.003 [ascend][curpid: 2622, 3009][drv][queuemng][QueueIoctl 133]Ioctl failed. (cmd=40585102; error=0; ret=16) [ERROR] DRV(2622,python3):2023-05-17-10:05:11.057.125 [ascend][curpid: 2622, 3009][drv][queuemng][QueueSubmitEventSync 724]enqueue ioctl failed. (ret=16; event_id=4; gid=20; tid=0; timeout=5000ms; subevent_id=1). [ERROR] DRV(2622,python3):2023-05-17-10:05:11.057.132 [ascend][curpid: 2622, 3009][drv][queuemng][QueueSendQueueEventSyncTimeout 871]Submit event failed. (ret=16; devId=0; qid=1) [ERROR] RUNTIME(2622,python3):2023-05-17-10:05:11.057.143 [npu_driver.cc:3205]3009 MemQueueEnQueueBuff:report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(2622,python3):2023-05-17-10:05:11.057.147 [npu_driver.cc:3205]3009 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: deviceId=0, qid=1, timeout=3000, drvRetCode=16. [ERROR] RUNTIME(2622,python3):2023-05-17-10:05:11.057.187 [api_c.cc:3783]3009 rtMemQueueEnQueueBuff:ErrCode=507012, desc=[report timeout], InnerCode=0x711000c
该类型错误需要联系技术支持定位排查。 您可以获取日志后单击Link联系技术支持。