应用进程占用内存超出限制导致系统异常终止
现象描述
以ATC工具为例,当打开了虚拟内存限制而atc尝试使用超出限制的内存大小时会报错,报错信息一般如下所示:
ATC start working now, please wait for a moment. terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable /usr/local/Ascend/ascend-toolkit/latest/bin/atc: line 17: 44752 Aborted (core dumped) ${PKG_PATH}/bin/atc.bin "$@" (base) root@davinci-mini:/home/lanxi-samples/samples/cplusplus/level1_single_api/4_op_dev/2_verify_op/acl_execute_add/run/out# Process ForkServerProcess-2: Traceback (most recent call last): File "/root/miniconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/root/miniconda3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 61, in wrapper raise exp File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper func(*args, **kwargs) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 275, in task_distribute key, func_name, detail = resource_proxy[TASK_QUEUE].get() File "<string>", line 2, in get File "/root/miniconda3/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod kind, result = conn.recv() File "/root/miniconda3/lib/python3.9/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/root/miniconda3/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/root/miniconda3/lib/python3.9/multiprocessing/connection.py", line 388, in _recv raise EOFError
解决方案
- Ctrl+C关闭当前进程。
如果发现系统卡死可以在SSH客户端界面中立刻按Ctrl+C关闭当前进程。
注意:系统可能需要至多50秒才能从假死状态中恢复响应。
- 借助ulimit命令,在atc等内存消耗比较大的进程启动前设置内存使用上限,防止内存耗尽导致宕机。
目前的测试中,仅限制虚拟内存的使用也有防止系统宕机的作用。
- 打开虚拟内存限制
ulimit -v 20480000
2048000代表kb为单位的虚拟内存上限即2GB。
查看虚拟内存限制是否生效,可以使用ulimit -a命令或者直接使用ulimit -v。
ulimit -a输出:(base) root@davinci-mini:~# ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 12584 max locked memory (kbytes, -l) 449876 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 12584 virtual memory (kbytes, -v) 2048000 file locks (-x) unlimited
ulimit -v输出:(base) root@davinci-mini:~# ulimit -v 2048000
- 关闭虚拟内存限制
当确定不需要运行atc等有宕机风险的程序时,可以关闭内存限制。
ulimit -v unlimited
确保内存限制已关闭,可再次使用ulimit -a查看限制详情。
- 打开虚拟内存限制
父主题: 样例运行