运行报错failed to bind memory,但应用仍可以正常结束
现象描述
在单节点或者多节点运行MPI作业,指定-bind-to core,报错“failed to bind memory”。
1 2 3 4 5 6 7 8 9 10 11 12 13 | -------------------------------------------------------------------------- WARNING: Open MPI tried to bind a process but failed. This is a warning only; your job will continue, though performance may be degraded. Local host: node175 Application name: /root/tqa/osu-micro-benchmarks-7.1-1/c/mpi/collective/blocking/osu_bcast Error message: failed to bind memory Location: rtc_hwloc.c:447 -------------------------------------------------------------------------- [node175:06049] 63 more processes have sent help message help-orte-odls-default.txt / memory not bound [node175:06049] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages |
可能原因
节点上没有插满内存条,导致部分NUMA node没有绑定内存,可以通过numactl --hardware命令检查。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31566 MB node 0 free: 30548 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 31049 MB node 2 free: 29424 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 12 35 37 1: 12 10 37 40 2: 35 37 10 12 3: 37 40 12 10 |
其中node1和node2均没有绑定内存。
恢复步骤
- 进程规模较小的情况下,可以尝试指定rankfile,避免进程绑定在没有内存的NUMA上。
- 插满内存条。
父主题: FAQ