由于丢包等网络原因引起的UD超时报错:UD endpoint ...... unhandled timeout error
现象描述
运行MPI作业时,报错如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | [arm-129:435170:0:435170] ud_ep.c:262 Fatal: UD endpoint 0x6c29690 to <no debug data>: unhandled timeout error ==== backtrace (tid: 435170) ==== 0 /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(ucs_handle_error+0x250) [0x4000237b3630] 1 /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(ucs_fatal_error_message+0xd0) [0x4000237b0940] 2 /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1fa08) [0x4000237b0a08] 3 /workspace/cw/ccsuite/hmpi/install/hucx/lib/ucx/libuct_ib.so.0(+0x475a8) [0x4000238795a8] 4 /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x18d10) [0x4000237a9d10] 5 /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucp.so.0(ucp_worker_progress+0x60) [0x4000236f47b0] 6 /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libopen-pal.so.40(opal_progress+0x38) [0x4000223cbef8] 7 /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(ompi_mpi_init+0xc78) [0x4000220be608] 8 /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(MPI_Init+0x64) [0x400022066404] 9 /workspace/cw/cwScript/mpijob/bcast_sleep_accurate() [0x400a6c] 10 /usr/lib64/libc.so.6(+0x2afbc) [0x400022146fbc] 11 /usr/lib64/libc.so.6(__libc_start_main+0x94) [0x400022147094] 12 /workspace/cw/cwScript/mpijob/bcast_sleep_accurate() [0x400930] ================================= [arm-129:435170] *** Process received signal *** [arm-129:435170] Signal: Aborted (6) [arm-129:435170] Signal code: (-6) [arm-129:435170] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x400021fd393c] [arm-129:435170] [ 1] /usr/lib64/libc.so.6(+0x80e78)[0x40002219ce78] [arm-129:435170] [ 2] /usr/lib64/libc.so.6(raise+0x1c)[0x400022158cfc] [arm-129:435170] [ 3] /usr/lib64/libc.so.6(abort+0xe0)[0x400022146d2c] [arm-129:435170] [ 4] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1f944)[0x4000237b0944] [arm-129:435170] [ 5] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1fa08)[0x4000237b0a08] [arm-129:435170] [ 6] /workspace/cw/ccsuite/hmpi/install/hucx/lib/ucx/libuct_ib.so.0(+0x475a8)[0x4000238795a8] [arm-129:435170] [ 7] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x18d10)[0x4000237a9d10] [arm-129:435170] [ 8] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucp.so.0(ucp_worker_progress+0x60)[0x4000236f47b0] [arm-129:435170] [ 9] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libopen-pal.so.40(opal_progress+0x38)[0x4000223cbef8] [arm-129:435170] [10] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(ompi_mpi_init+0xc78)[0x4000220be608] [arm-129:435170] [11] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(MPI_Init+0x64)[0x400022066404] [arm-129:435170] [12] /workspace/cw/cwScript/mpijob/bcast_sleep_accurate[0x400a6c] [arm-129:435170] [13] /usr/lib64/libc.so.6(+0x2afbc)[0x400022146fbc] [arm-129:435170] [14] /usr/lib64/libc.so.6(__libc_start_main+0x94)[0x400022147094] [arm-129:435170] [15] /workspace/cw/cwScript/mpijob/bcast_sleep_accurate[0x400930] [arm-129:435170] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 5 with PID 435171 on node arm-129 exited on signal 6 (Aborted). -------------------------------------------------------------------------- |
可能原因
- 没有使用补丁版本的HUCX源码包,通过调度器指定UCX_TLS=ud,将MPI作业挂起一段时间再恢复,导致超时报错。
- 使用RoCE网络运行,但是没有配置网卡侧和交换机侧的无损网络。
- 计算节点间的网络线路出现了故障。
恢复步骤
- 在对应版本HUCX的最新补丁包中已经支持该需求,下载对应版本的补丁包,重新编译安装。
- 如果没有配置网卡侧和交换机侧的无损网络,需要配置后再运行作业。
- 如果无损网络没有问题,排查出错节点间的网络线路是否有问题。
- 如果物理链路和硬件配置上未排查出问题,可以设置-x UCX_UD_MLX5_TIMER_BACKOFF=1 -x UCX_UD_MLX5_TIMER_TICK=100ms -x UCX_UD_MLX5_TIMEOUT=600s增大超时时间,暂时规避问题。
父主题: FAQ