Runtime执行报错,在plog日志中Runtime打印了PrintErrorInfo:Notify wait execute failed的错误信息和PrintStreamTimeoutSnapshotInfo的Snapshot快照打印信息。
plog日志在{install_path}/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。
[ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.548 [engine.cc:1361]935255 ReportExceptProc:Real task exception! device_id=0, stream_id=1113, task_id=2, task_type=14 (NOTIFY_WAIT) [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.557 [engine.cc:1366]935255 ReportExceptProc:Task exception! device_id=0, stream_id=1867, task_id=1, type=13(MODEL_EXECUTE), failuremode =0, retCode=0x91, [the model stream execute failed] [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.101.593 [device_error_proc.cc:968]935255 PrintStreamTimeoutSnapshotInfo:stream_id=1867, task_id=1, taskType=13 (MODEL_EXECUTE), . [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.330 [task.cc:3786]935255 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.336 [task.cc:3766]935255 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=1867, model task_id=1, flip_num=0, model_id=773, first_task_id=65535. [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.344 [task.cc:4106]935255 PrintErrorInfo:Notify wait execute failed, device_id=0, stream_id=1113, task_id=2, flip_num=0, notify_id=535 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.350 [callback.cc:91]935255 Notify:notify [_DEFAULT_MODEL_NAME_] task fail start.notify taskid:2 streamid:1113 retcode:507011 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.363 [stream.cc:1139]935255 GetError:Stream Synchronize failed, stream_id=1867, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.366 [stream.cc:1142]935255 GetError:report error module_type=2, module_name=EI9999 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.102.369 [stream.cc:1142]935255 GetError:Notify wait execute failed, device_id=0, stream_id=1113, task_id=2, flip_num=0, notify_id=535 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.670 [logger.cc:364]935255 StreamSynchronize:Stream synchronize failed [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.688 [api_c.cc:735]935255 rtStreamSynchronize:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.691 [error_message_manage.cc:49]935255 FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(935255,rtstest_host):2023-05-11-20:08:58.110.695 [error_message_manage.cc:49]935255 FuncErrorReason:rtStreamSynchronize execute failed, reason=[the model stream execute failed]
可能是多P训练中,其中1P异常,其他P等待超时;也有可能是多P训练中其中1P计算不同步,导致其他P等待超时;还有可能是通信失败,导致通信丢失报错。
您可以获取日志后单击Link联系技术支持。
该问题会导致HCCL报错Task run failed,任务类型为Notify Wait,并打印在plog日志中。
[ERROR]HCCL(22728,python3.7):2023-05-11-11:15:22.967.300 [task_exception_handler.cc:223][22728][64817][EXEC][EXEC][TaskExceptionHandler][Callback]Task run failed, base information is streamID:[67], taskID[55], taskType[Notify Wait], tag[HcomAllGather_6629421139219749105_5]. [ERROR]HCCL(22728,python3.7):2023-05-11-11:15:22.967.335 [task_exception_handler.cc:225][22728][64817][EXEC][EXEC][TaskExceptionHandler][Callback]Task run failed, para information is notify id:[0x00000003000000a0], stage:[ffffffff], remote rank:[11].