Memcpy异步拷贝算子执行报错

适用场景

现象描述

Runtime执行报错,在plog日志中Runtime打印了Memory async copy failed的错误信息。

plog日志在{install_path}/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。

[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.095 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=1, type=13, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.955 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.075 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.079 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096, src_addr=0x124080015000, dst_addr=0x124080016000.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.082 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed].

可能原因

计算溢出、拷贝地址错误、多P训练时进程退出等。可以通过在plog日志中查找CQE status值确定具体原因(不同处理器平台,打印的日志形式不一样):

[ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.867 [device_error_proc.cc:1123]109239 ProcessStarsSdmaErrorInfo:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.877 [device_error_proc.cc:1123]109239 ProcessStarsSdmaErrorInfo:The error from device(chipId:0, dieId:0), serial number is 1633. there is a sdma error, sdma channel is 11, sdmaBlkFsmState=0x7, dfxSdmaBlkFsmOstCnt=0x0, sdmaChFree=0x0, irqStatus=0x220000, cqeStatus=0x3 
[ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.934 [task.cc:2251]109239 DoCompleteSuccess:mem async copy error, retCode=0x20b, [memcpy exception].

[ERROR] RUNTIME(92299,python):2023-06-08-12:25:55.259.270 [device_error_proc.cc:652]97319 ProcessSdmaErrorInfo:The error from device(0), serial number is 1. there is a sdma error, sdma channel is 0, the channel exist the following problems: The SMMU returns a Terminate error during page table translation.. the value of CQE status is 2. the description of CQE status: When the SQE translates a page table, the SMMU returns a Terminate error.it's config include: setting1=0xc0000008c090000, setting2=0xff009000ff004c, setting3=0x1f, sq base addr=0x6243d000
表1 CQE状态值描述

Value

Description

0000h

Successful completion: The command completed successfully.

0001h

表示可能出现如下错误中的一种:

  • Submission Descriptor read response error。
  • SQE中出现非法OPCODE。

0002h

SQE进行页表翻译时,SMMU返回了Terminate错误。

0003h

Reserved。

0004h

SDMAA上报的错误,表示使用了非安全属性访问安全DDR空间。

0005h

SDMAA上报的错误,表示SDMAM下发的搬运地址在DAW没有映射。

0006h

SDMAA上报的错误,表示出现操作类型错误。

0007h

SDMAA上报的错误,表示SDMAA搬运过程中出现DDRC错误。

0008h

SDMAA上报的错误,表示SDMAA搬运过程中出现ECC错误。

0009h

SDMAA上报的错误,表示SDMAA搬运过程中出现COMPERR。

000Ah

SDMAA上报的错误,表示SDMAA搬运过程中出现COMPDATAERR。

000Bh

SDMAA上报的错误,表示reduce操作出现上溢错误。

000Ch

SDMAA上报的错误,表示reduce操作出现下溢错误。

000Dh

SDMAA上报的错误,表示reduce源数据格式不符合要求。

000Eh

SDMAA上报的错误,表示reduce目的数据格式不符合要求。

000Fh

SDMAA上报的错误,表示reduce源和目的数据格式不符合要求。

else

Reserved

处理步骤

根据错误码提示,进行后续处理。

例如,错误码0x2代表缺少页表。

这三种错误,由调用者分析具体错误情况。

例如,错误码:0x9、0xA、0xB、0xC、0xD、0xE、0xF。

可能导致的故障

该问题会导致ACL报错Execute model failed,并打印在plog日志中。

[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.187 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[0]
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.193 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[0] execute failed, result[500002]
...
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.397 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[1]
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.399 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[1] execute failed, result[500002]