指定主机名错误
现象描述
提交MPI作业时指定hostfile中的主机名错误导致mpirun命令运行失败。
运行失败示例如下:
$ mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 /home/hmpi_user/hmpifile_2021/allreduce/AllReduce
ssh: Could not resolve hostname arm-node056: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: arm-node056
target node: arm-node011
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[arm-node132:2640389] 5 more processes have sent help message help-errmgr-base.txt / no-path
[arm-node132:2640389] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
可能原因
运行mpirun命令时hf8文件中指定的主机名在局域网中不存在。
恢复步骤
- 使用PuTTY工具,以Hyper MPI普通用户(例如“hmpi_user”)登录至作业执行节点。
- 执行以下命令,修改“hf8”文件。
- 执行以下命令,验证“hf8”文件是否修改成功。
mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce
- ~/hmpifile_2021/allreduce:表示指定运行作业的路径。
- AllReduce:表示指定的运行作业,用户可根据实际情况进行修改。
出现以下回显信息,表示“hf8”文件修改成功。
All tests are success
父主题: FAQ