中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

Allgatherv算法4指定TCP传输,np大规格执行报错

现象描述

集合操作Allgatherv算法4执行传输模式为tcp,多节点满核拉起进程报错:

[autotest1@hmpi01 ~]$ mpirun --allow-run-as-root --timeout 350 -np 1024  -N 128  --hostfile ~/hmpifile_2021/hostfile/hf8  -x UCX_TLS=tcp  -x UCG_PLANC_UCX_ALLGATHERV_ATTR=I:4 ~/hmpifile_2021/allgatherv/allgatherv
 
Authorized users only. All activities may be monitored and reported.
[hmpi03:04566] pml_ucx.c:428  Error: ucp_ep_create(proc=514) failed: Destination is unreachable
[hmpi03:04563] pml_ucx.c:428  Error: ucp_ep_create(proc=515) failed: Destination is unreachable
[1684893155.703980] [hmpi06:4797 :1]         tcp_cm.c:749  UCX  WARN  tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful
[1684893155.703997] [hmpi06:4797 :1]         tcp_cm.c:749  UCX  WARN  tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful
[1684893256.847887] [hmpi05:11722:0]         tcp_cm.c:705  UCX  ERROR tcp_ep 0x86843a0: reached maximum number of connection retries (25)

可能原因

tcp建链产生的软中断重度负载,导致cpu无法处理,所以tcp建链超时。这对于Allgatherv线性算法来说,属于是tcp本身的限制,正常现象。

恢复步骤

Allgatherv使用tcp传输模式时,优先使用其他算法。