中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

多节点运行mpirun时异常

现象描述

  • 多节点运行mpirun命令时,无任何反应。运行top命令时,无mpirun进程。
  • 多节点运行mpirun命令时,出现如下报错信息:
    [1632387881.405868] [arm-node88:57923:0]       mm_posix.c:194  UCX  ERROR shm_open(file_name=/ucx_shm_posix_23f3f65f flags=0xc2) failed: Permission denied
    [1632387881.405910] [arm-node88:57923:0]        uct_mem.c:132  UCX  ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error
    [1632387881.405917] [arm-node88:57923:0]       mm_iface.c:605  UCX  ERROR mm_iface failed to allocate receive FIFO
    [arm-node88:57923] coll_ucx_component.c:360  Warning: Failed to create UCG worker, automatically select other available and highest priority collective component.
    [1632387881.411347] [arm-node88:57923:0]       mm_posix.c:194  UCX  ERROR shm_open(file_name=/ucx_shm_posix_6ae5143e flags=0xc2) failed: Permission denied
    [1632387881.411359] [arm-node88:57923:0]        uct_mem.c:132  UCX  ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error
    [1632387881.411366] [arm-node88:57923:0]       mm_iface.c:605  UCX  ERROR mm_iface failed to allocate receive FIFO
    [arm-node88:57923] pml_ucx.c:274  Error: Failed to create UCP worker
    [arm-node88:57923] *** An error occurred in MPI_Allreduce
    [arm-node88:57923] *** reported by process [878510081,70368744177671]
    [arm-node88:57923] *** on communicator MPI_COMM_WORLD
    [arm-node88:57923] *** MPI_ERR_INTERN: internal error
    [arm-node88:57923] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    [arm-node88:57923] ***    and potentially your MPI job)

可能原因

多节点运行mpirun命令时,部分节点与其他节点不能互相通讯。

恢复步骤

  1. 使用PuTTY工具,以Hyper MPI普通用户,如“hmpi_user”用户登录作业执行节点。
  2. 建议将Hyper MPI安装在已挂载的共享目录上。
  3. 检查环境变量是否配置正确,详情请参考配置环境变量