getnext算子超时

适用场景

现象描述

执行训练脚本时出现getnext算子超时。

Error Message is :
E30008: AI CPU operator execution time out.
Possible Cause: 1. For the GetNext operator, its preprocessing duration may be too long. 2. For a custom operator, its logic may be improper.
Solution: 1. For the GetNext operator, check its preprocessing or set OpExecuteTimeOut to a larger value. 2. For a custom operator, make sure its logic is proper.
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=2, task_id=2, fault op_name=aicpu_getnext_IteratorGetNext[FUNC:GetError][FILE:stream.cc][LINE:1133]
rtStreamSynchronizeWithTimeout execute failed, reason=[aicpu timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]

[[{{node GeOp2_0}}]]

可能原因

可通过查询Device日志定界,日志中能记录getnext超时后Device侧驱动队列中的出队/入队相关信息:

debug/device-0/device-1927_20230406202333033.log:147:[ERROR] AICPU(3480,aicpu_scheduler):2023-04-06-20:24:34.559.758 [kernel_util.cc:101][AICPU][operator():101][tid:3543]:device_id:0, queue_name:12658048656348665736, queue_id:1, size:0, depth:2, status:1, workMode:1, type:2, enqueCnt:0, dequeCnt=0, enqueFailCnt=0, dequeFailCnt=1, enqueEventOk=0, enqueEventFail=0, FullToNotFullEventOkCnt=0, FullToNotFullEventFailCnt = 0, lastEnqueTime.tv_sec:0, lastEnqueTime.tv_usec:0, lastDequeTime.tv_sec:0, lastDequeTime.tv_usec:0

可通过入队/出队信息确认数据集是否正常发送到了Device侧,如果入队数量很少,则可能是数据集生成不稳定或者数据集传输网络不稳定或者预处理阶段耗时较大。

处理步骤

针对以上可能原因,可参考以下步骤处理:

  1. 检查训练模型的输入数据集是否正常生成以及传输是否稳定。
  2. 检查host侧预处理过程处理逻辑是否存在耗时较大情况(数据集正常情况下,getnext超时后aicpu记录的出队/入队统计中enqueue数量较少或者lastEnqueueTime时间比较大说明预处理阶段耗时大),如果确认预处理阶段耗时较久,可通过OpExecuteTimeOut接口修改算子超时时间更长(默认超时时间28s)。