通信算子传入非连续tensor
问题现象
打屏日志中存在关键字“RuntimeError: Tensors must be contiguous”,类似如下屏显信息:
1 2 3 4 5 6 7 8 9 10 11 |
Traceback (most recent call last): File "distributed/_mode_cases/error_discontinuous_tensor.py", line 21, in <module> discontinuous_tensor() File "distributed/_mode_cases/error_discontinuous_tensor.py", line 18, in discontinuous_tensor dist.all_reduce(input) File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: Tensors must be contiguous [ERROR] 2024-08-18-22:15:47 (PID:23232, Device:0, RankID:0) ERR02002 DIST invalid type |
故障根因
关键过程:启动的分布式任务报该错误。
根本原因分析:通信算子传入了非连续tensor。
处理方法
根据日志信息找到报错的代码行,检查输入数据的连续性,通过.contiguou()将非连续tensor转换为连续tensor。
Error Code |
ERR02002 |
---|---|
故障事件名称 |
通信算子传入非连续tensor |
故障解释/可能原因 |
代码脚本问题 |
故障影响 |
通信算子失败 |
故障自处理模式 |
检查代码,保证通信算子传入的tensor是连续的 |
系统处理建议 |
无需操作 |
父主题: 故障案例集