下载
中文
注册

“bash:orted:未找到命令”错误

问题现象

安装Open MPI-4.1.5的场景中,执行性能测试命令时,报“bash: orted: 未找到命令”的错误,如下所示:

bash: orted: 未找到命令
--------------------------------------------------------------------------
A daemon (pid 8793) died unexpectedly with status 127 while attempting
to launch so we are aborting.
 
There may be more information reported by the environment (see above).
 
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

可能原因

集群中存在未退出的openmpi进程。

解决方法

利用mpirun的能力,终止残余的openmpi进程,操作步骤如下:
  1. 自定义一个Hostfile文件,例如命名为hostfile_1,内容格式如下:
    worker-0 slots=1
    worker-1 slots=1
    worker-2 slots=1
    worker-3 slots=1
    ... ...
    worker-510 slots=1
    worker-511 slots=1

    其中worker-0”到“worker-511”是集群中节点的主机名,“slots=1”代表该节点上仅开启一个进程,此Hostfile文件中需要包含参与集合通信的所有节点的信息。

  2. 执行如下命令,终止集群中所有节点上的多余openmpi进程。

    /usr/local/openmpi-4.1.5/bin/mpirun -hostfile hostfile_1 -n 512 pkill -9 -f openmpi

    • -n 512:512代表集群中的节点数量。
    • hostfile_1是1中定义的Hostfile文件。