找不到ARP表项引起的超时报错:ibv_create_ah...failed: Connection timed out

现象描述

用户在提交大规模MPI作业时高概率报错建链超时,作业输出日志报错有“Connection timed out”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[agent363:373418:0:373418]    ud iface.c:49   Fatal: iface 0x1ddfcb30: failed to get peer address
=== backtrace (tid: 373418) ====
0 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(ucs_fatal_error_message+0x38) [0x40012c420128]
1 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(+0x2025c) [0x40012c42025c]
2 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_iface_cep_insert_ep+0) [0x40012c5157e0]
3 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_ep_create_connected_common+0xd4) [0x40012c5184c4]
4 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_ep_connect_aux+0xc0) [0x400127f51be0]
5 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/1ibucp.so.0(ucp_wireup_ep_connect+0xe4)[0x400127f522e4]
6 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_init_lanes+0x8d4)[0x400127f53e94]
7 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create_to_worker_addr+0x78) [0x400127f1cf58]
8 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create+0x4b0) [0x400127f1dbbe]
9 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pml_ucx.so(+0x5940)[0x400127e55940]
10 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pm1_ucx.so(mca_pml_ucx_send+0x198) [0x400127e54304]
11 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbc)[0x4001266ffafc]
12 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_intra_bruck+0xac)[0x4001266fe710]
13 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/openmpi/mca_coll_ucx.so(+0x5b08) [0x40012cb25b08]
14 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/1ibmpi.so.40(mca_coll_base_comm_select+0x880)[0x4001266f3d64]
15 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(ompi_mpi_init+0xe30) [0x40012672a7a4]
16 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(MPI_Init+0xa8) [0x4001266d8468]
17  hello_mpi() [0x4008c0]
18 /usr/1ib64/1ibc.so.6(__libc_start_main+0xe0) [0x400126813f40]
19 hello_mpi() [0x4007dc]
================================

作业错误日志有“failed to get peer address”:

1
2
[1693797506.380997] [agent289:3924149:0]        ib_device.c:1252 UCX  ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.115 sgid index=5 traffic class=106) on mlx5 0 failed: Connection timed out
[1693797506.384196] [agent276:1958414:0]        ib_device.c:1252 UCX  ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.62 sgid_index=5 traffic_class=106) on mlx5_O failed: Connection timed out

可能原因

恢复步骤

计算节点配置静态ARP表项。