Runtime报Notify wait错误

适用场景

现象描述

Runtime执行报错,在plog日志中Runtime打印了PrintErrorInfo:Notify wait execute failed的错误信息和PrintStreamTimeoutSnapshotInfo的Snapshot快照打印信息。

plog日志在{install_path}/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。

[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.548 [engine.cc:1361]935255 ReportExceptProc:Real task exception! device_id=0, stream_id=1113, task_id=2, task_type=14 (NOTIFY_WAIT)
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.557 [engine.cc:1366]935255 ReportExceptProc:Task exception! device_id=0, stream_id=1867, task_id=1, type=13(MODEL_EXECUTE), failuremode =0, retCode=0x91, [the model stream execute failed]
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.593 [device_error_proc.cc:968]935255 PrintStreamTimeoutSnapshotInfo:stream_id=1867, task_id=1, taskType=13 (MODEL_EXECUTE), .
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.330 [task.cc:3786]935255 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.336 [task.cc:3766]935255 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=1867, model task_id=1, flip_num=0, model_id=773, first_task_id=65535.
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.344 [task.cc:4106]935255 PrintErrorInfo:Notify wait execute failed, device_id=0, stream_id=1113, task_id=2, flip_num=0, notify_id=535
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.350 [callback.cc:91]935255 Notify:notify [_DEFAULT_MODEL_NAME_] task fail start.notify taskid:2 streamid:1113 retcode:507011
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.363 [stream.cc:1139]935255 GetError:Stream Synchronize failed, stream_id=1867, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.366 [stream.cc:1142]935255 GetError:report error module_type=2, module_name=EI9999
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.369 [stream.cc:1142]935255 GetError:Notify wait execute failed, device_id=0, stream_id=1113, task_id=2, flip_num=0, notify_id=535
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.670 [logger.cc:364]935255 StreamSynchronize:Stream synchronize failed
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.688 [api_c.cc:735]935255 rtStreamSynchronize:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.691 [error_message_manage.cc:49]935255 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.695 [error_message_manage.cc:49]935255 FuncErrorReason:rtStreamSynchronize execute failed, reason=[the model stream execute failed]

可能原因

可能是多P训练中,其中1P异常,其他P等待超时;也有可能是多P训练中其中1P计算不同步,导致其他P等待超时;还有可能是通信失败,导致通信丢失报错。

处理步骤

  1. 在plog日志中,执行命令grep -rns ERROR查看所有训练设备的ERROR日志,排查Notify wait超时前的所有故障并完成处理。
  2. 集合通信场景下已经定位到根节点且该类型报错为首报错,单机场景下定位到该节点且该错误类型为首报错,需要联系技术支持定位排查。

    您可以获取日志后单击Link联系技术支持。

可能导致的故障

该问题会导致HCCL报错Task run failed,任务类型为Notify Wait,并打印在plog日志中。

[ERROR]HCCL(22728,python3.7):2023-05-11-11:15:22.967.300 [task_exception_handler.cc:223][22728][64817][EXEC][EXEC][TaskExceptionHandler][Callback]Task run failed, base information is streamID:[67], taskID[55], taskType[Notify Wait], tag[HcomAllGather_6629421139219749105_5].
[ERROR]HCCL(22728,python3.7):2023-05-11-11:15:22.967.335 [task_exception_handler.cc:225][22728][64817][EXEC][EXEC][TaskExceptionHandler][Callback]Task run failed, para information is notify id:[0x00000003000000a0], stage:[ffffffff], remote rank:[11].