用户进程异常退出后重启进程失败

现象描述

用户进程卡住或者用户强制退出进程后,再次重启,重启后发现进程无法正常启动。类似的日志信息如下:

AscendCL日志信息:aclrtProcessReport failed

aclrtProcessReport failed, ret = 107012
aclrtProcessReport failed, ret = 107012

Runtime日志信息:halResourceIdAlloc xxx failed

[ERROR] DRV(2086,rtstest_host):2021-06-09-02:14:46.034.368 [ascend][curpid: 2086, 2086][drv][tsdrv][halResourceIdAlloc 477]id is exhausted, type(0 stream), range[0, 1024), dev_id(0), tsid(0).
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.380 [npu_driver.cc:285]2086 StreamIdAlloc:[driver interface] halResourceIdAlloc streamid failed: device_id=0, tsId=0, drvRetCode=48!
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.401 [stream.cc:448]2086 Setup:Failed to alloc stream id, retCode=0x702001a.
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.416 [context.cc:1251]2086 StreamCreate:Setup stream failed, retCode=0x702001a.
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.440 [logger.cc:211]2086 StreamCreate:Create stream failed, priority=7 ,flags=0.
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.458 [api_c.cc:461]2086 rtStreamCreateWithFlags:ErrCode=207008, desc=[driver error:no stream resource], InnerCode=0x702001a
[ERROR] RUNTIME(2086,rtstest_host):2021-06-09-02:14:46.034.469 [error_message_manage.cc:26]2086 ReportFuncErrorReason:rtStreamCreateWithFlags execute failed, reason=[driver error:no stream resource]

可能原因

通过日志分析无法正常重启的原因可能是public taskid、stream id、eventid等资源申请不到引起的:

处理步骤

针对上述可能原因,可以按以下方式处理: