提交MPI作业时指定的网卡名称错误导致mpirun命令运行失败。
运行失败示例如下:
mpirun -np 8 -N 1 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:2 ~/hmpifile_2021/allreduce/AllReduce
1 | [1632383945.549496] [arm-node132:2635376:0] ucp_context.c:732 UCX WARN network device 'mlx5_0:2' is not available, please use one or more of: 'enp189s0f0'(tcp), 'enp1s0'(tcp), 'mlx5_0:1'(ib) |
运行mpirun命令时指定的网卡资源名称有误。
ibdev2netdev
1 | mlx5_0 port 1 ==> enp1s0 (Up) |
mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce