Allgatherv算法4指定TCP传输,np大规格执行报错
现象描述
集合操作Allgatherv算法4执行传输模式为tcp,多节点满核拉起进程报错:
[autotest1@hmpi01 ~]$ mpirun --allow-run-as-root --timeout 350 -np 1024 -N 128 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_TLS=tcp -x UCG_PLANC_UCX_ALLGATHERV_ATTR=I:4 ~/hmpifile_2021/allgatherv/allgatherv Authorized users only. All activities may be monitored and reported. [hmpi03:04566] pml_ucx.c:428 Error: ucp_ep_create(proc=514) failed: Destination is unreachable [hmpi03:04563] pml_ucx.c:428 Error: ucp_ep_create(proc=515) failed: Destination is unreachable [1684893155.703980] [hmpi06:4797 :1] tcp_cm.c:749 UCX WARN tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful [1684893155.703997] [hmpi06:4797 :1] tcp_cm.c:749 UCX WARN tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful [1684893256.847887] [hmpi05:11722:0] tcp_cm.c:705 UCX ERROR tcp_ep 0x86843a0: reached maximum number of connection retries (25)
可能原因
tcp建链产生的软中断重度负载,导致cpu无法处理,所以tcp建链超时。这对于Allgatherv线性算法来说,属于是tcp本身的限制,正常现象。
恢复步骤
Allgatherv使用tcp传输模式时,优先使用其他算法。
父主题: FAQ