Runtime执行报错,在plog日志中Runtime打印了Memory async copy failed的错误信息。
plog日志在{install_path}/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.095 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=1, type=13, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.955 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.075 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.079 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096, src_addr=0x124080015000, dst_addr=0x124080016000. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.082 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed].
计算溢出、拷贝地址错误、多P训练时进程退出等。可以通过在plog日志中查找CQE status值确定具体原因(不同处理器平台,打印的日志形式不一样):
[ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.867 [device_error_proc.cc:1123]109239 ProcessStarsSdmaErrorInfo:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.877 [device_error_proc.cc:1123]109239 ProcessStarsSdmaErrorInfo:The error from device(chipId:0, dieId:0), serial number is 1633. there is a sdma error, sdma channel is 11, sdmaBlkFsmState=0x7, dfxSdmaBlkFsmOstCnt=0x0, sdmaChFree=0x0, irqStatus=0x220000, cqeStatus=0x3 [ERROR] RUNTIME(109239,ascend-dmi):2023-06-12-10:25:40.168.934 [task.cc:2251]109239 DoCompleteSuccess:mem async copy error, retCode=0x20b, [memcpy exception].
或
[ERROR] RUNTIME(92299,python):2023-06-08-12:25:55.259.270 [device_error_proc.cc:652]97319 ProcessSdmaErrorInfo:The error from device(0), serial number is 1. there is a sdma error, sdma channel is 0, the channel exist the following problems: The SMMU returns a Terminate error during page table translation.. the value of CQE status is 2. the description of CQE status: When the SQE translates a page table, the SMMU returns a Terminate error.it's config include: setting1=0xc0000008c090000, setting2=0xff009000ff004c, setting3=0x1f, sq base addr=0x6243d000
Value |
Description |
---|---|
0000h |
Successful completion: The command completed successfully. |
0001h |
表示可能出现如下错误中的一种:
|
0002h |
SQE进行页表翻译时,SMMU返回了Terminate错误。 |
0003h |
Reserved。 |
0004h |
SDMAA上报的错误,表示使用了非安全属性访问安全DDR空间。 |
0005h |
SDMAA上报的错误,表示SDMAM下发的搬运地址在DAW没有映射。 |
0006h |
SDMAA上报的错误,表示出现操作类型错误。 |
0007h |
SDMAA上报的错误,表示SDMAA搬运过程中出现DDRC错误。 |
0008h |
SDMAA上报的错误,表示SDMAA搬运过程中出现ECC错误。 |
0009h |
SDMAA上报的错误,表示SDMAA搬运过程中出现COMPERR。 |
000Ah |
SDMAA上报的错误,表示SDMAA搬运过程中出现COMPDATAERR。 |
000Bh |
SDMAA上报的错误,表示reduce操作出现上溢错误。 |
000Ch |
SDMAA上报的错误,表示reduce操作出现下溢错误。 |
000Dh |
SDMAA上报的错误,表示reduce源数据格式不符合要求。 |
000Eh |
SDMAA上报的错误,表示reduce目的数据格式不符合要求。 |
000Fh |
SDMAA上报的错误,表示reduce源和目的数据格式不符合要求。 |
else |
Reserved |
根据错误码提示,进行后续处理。
例如,错误码0x2代表缺少页表。
这三种错误,由调用者分析具体错误情况。
例如,错误码:0x9、0xA、0xB、0xC、0xD、0xE、0xF。
该问题会导致ACL报错Execute model failed,并打印在plog日志中。
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.187 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[0] [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.193 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[0] execute failed, result[500002] ... [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.397 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[1] [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.399 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[1] execute failed, result[500002]