用户在提交大规模MPI作业时高概率报错建链超时,作业输出日志报错有“Connection timed out”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | [agent363:373418:0:373418] ud iface.c:49 Fatal: iface 0x1ddfcb30: failed to get peer address === backtrace (tid: 373418) ==== 0 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(ucs_fatal_error_message+0x38) [0x40012c420128] 1 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(+0x2025c) [0x40012c42025c] 2 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_iface_cep_insert_ep+0) [0x40012c5157e0] 3 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_ep_create_connected_common+0xd4) [0x40012c5184c4] 4 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_ep_connect_aux+0xc0) [0x400127f51be0] 5 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/1ibucp.so.0(ucp_wireup_ep_connect+0xe4)[0x400127f522e4] 6 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_init_lanes+0x8d4)[0x400127f53e94] 7 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create_to_worker_addr+0x78) [0x400127f1cf58] 8 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create+0x4b0) [0x400127f1dbbe] 9 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pml_ucx.so(+0x5940)[0x400127e55940] 10 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pm1_ucx.so(mca_pml_ucx_send+0x198) [0x400127e54304] 11 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbc)[0x4001266ffafc] 12 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_intra_bruck+0xac)[0x4001266fe710] 13 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/openmpi/mca_coll_ucx.so(+0x5b08) [0x40012cb25b08] 14 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/1ibmpi.so.40(mca_coll_base_comm_select+0x880)[0x4001266f3d64] 15 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(ompi_mpi_init+0xe30) [0x40012672a7a4] 16 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(MPI_Init+0xa8) [0x4001266d8468] 17 hello_mpi() [0x4008c0] 18 /usr/1ib64/1ibc.so.6(__libc_start_main+0xe0) [0x400126813f40] 19 hello_mpi() [0x4007dc] ================================ |
作业错误日志有“failed to get peer address”:
1 2 | [1693797506.380997] [agent289:3924149:0] ib_device.c:1252 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.115 sgid index=5 traffic class=106) on mlx5 0 failed: Connection timed out [1693797506.384196] [agent276:1958414:0] ib_device.c:1252 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.62 sgid_index=5 traffic_class=106) on mlx5_O failed: Connection timed out |
计算节点配置静态ARP表项。