中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

作业进程数过大

现象描述

提交MPI作业的进程数大于集群作业执行节点CPU总核数,导致mpirun命令运行失败。

运行失败示例如下:

$ mpirun -np 1025 --hostfile hf8 hmpifile_2021/allreduce/AllReduce
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 1025
slots that were requested by the application:
 
  hmpifile_2021/allreduce/AllReduce
 
Either request fewer slots for your application, or make more slots
available for use.
 
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:
 
  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores
 
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
 
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

“1025”表示提交MPI作业的进程数。

可能问题

运行mpirun命令时的进程数大于集群作业执行节点CPU总核数。

恢复步骤

  1. 使用PuTTY工具,以Hyper MPI普通用户,如“hmpi_user”登录作业执行节点。
  2. 执行以下命令,查询每个作业执行节点的CPU核数。

    lscpu

    Architecture:          aarch64
    Byte Order:            Little Endian
    CPU(s):                128
    On-line CPU(s) list:   0-127
    Thread(s) per core:    1
    Core(s) per socket:    64
    Socket(s):             2
    NUMA node(s):          4
    Model:                 0
    CPU max MHz:           2600.0000
    CPU min MHz:           200.0000
    BogoMIPS:              200.00
    L1d cache:             64K
    L1i cache:             64K
    L2 cache:              512K
    L3 cache:              65536K
    NUMA node0 CPU(s):     0-31
    NUMA node1 CPU(s):     32-63
    NUMA node2 CPU(s):     64-95
    NUMA node3 CPU(s):     96-127
    Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop
  3. 计算集群作业执行节点CPU总核数,确保在运行mpirun命令时的进程数小于等于集群作业执行节点CPU总核数。以八个节点总核数1024为例,执行以下命令可成功提交MPI作业。
    mpirun -np 1024 --hostfile hf8 hmpifile_2021/allreduce/AllReduce
    All tests are success