由于网卡配置不当引起的超时报错:connect... failed:Connection timed out

现象描述

多节点运行MPI作业,作业无法正常运行,返回如下报错信息:

1
2
[1703209660.081479] [node167:2042308:0]           sock.c:272 UCX ERROR connect(fd=37, dest_addr=66.66.66.168:49703) failed: Connection timed out 
[node167:2042308] pml_ucx.c:426  Error: ucp_ep_create(proc=0) failed: Destination is unreachable

可能原因

恢复步骤