多节点运行mpirun时异常
现象描述
- 多节点运行mpirun命令时,无任何反应。运行top命令时,无mpirun进程。
- 多节点运行mpirun命令时,出现如下报错信息:
[1632387881.405868] [arm-node88:57923:0] mm_posix.c:194 UCX ERROR shm_open(file_name=/ucx_shm_posix_23f3f65f flags=0xc2) failed: Permission denied [1632387881.405910] [arm-node88:57923:0] uct_mem.c:132 UCX ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error [1632387881.405917] [arm-node88:57923:0] mm_iface.c:605 UCX ERROR mm_iface failed to allocate receive FIFO [arm-node88:57923] coll_ucx_component.c:360 Warning: Failed to create UCG worker, automatically select other available and highest priority collective component. [1632387881.411347] [arm-node88:57923:0] mm_posix.c:194 UCX ERROR shm_open(file_name=/ucx_shm_posix_6ae5143e flags=0xc2) failed: Permission denied [1632387881.411359] [arm-node88:57923:0] uct_mem.c:132 UCX ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error [1632387881.411366] [arm-node88:57923:0] mm_iface.c:605 UCX ERROR mm_iface failed to allocate receive FIFO [arm-node88:57923] pml_ucx.c:274 Error: Failed to create UCP worker [arm-node88:57923] *** An error occurred in MPI_Allreduce [arm-node88:57923] *** reported by process [878510081,70368744177671] [arm-node88:57923] *** on communicator MPI_COMM_WORLD [arm-node88:57923] *** MPI_ERR_INTERN: internal error [arm-node88:57923] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [arm-node88:57923] *** and potentially your MPI job)
可能原因
多节点运行mpirun命令时,部分节点与其他节点不能互相通讯。
恢复步骤
- 使用PuTTY工具,以Hyper MPI普通用户,如“hmpi_user”用户登录作业执行节点。
- 建议将Hyper MPI安装在已挂载的共享目录上。
- 检查环境变量是否配置正确,详情请参见配置环境变量。
父主题: FAQ