下载
中文
注册

通信算子传入非连续tensor

问题现象

打屏日志中存在关键字“RuntimeError: Tensors must be contiguous”,类似如下屏显信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Traceback (most recent call last):
  File "distributed/_mode_cases/error_discontinuous_tensor.py", line 21, in <module>
    discontinuous_tensor()
  File "distributed/_mode_cases/error_discontinuous_tensor.py", line 18, in discontinuous_tensor
    dist.all_reduce(input)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be contiguous
[ERROR] 2024-08-18-22:15:47 (PID:23232, Device:0, RankID:0) ERR02002 DIST invalid type

故障根因

关键过程:启动的分布式任务报该错误。

根本原因分析:通信算子传入了非连续tensor。

处理方法

根据日志信息找到报错的代码行,检查输入数据的连续性,通过.contiguou()将非连续tensor转换为连续tensor。

Error Code

ERR02002

故障事件名称

通信算子传入非连续tensor

故障解释/可能原因

代码脚本问题

故障影响

通信算子失败

故障自处理模式

检查代码,保证通信算子传入的tensor是连续的

系统处理建议

无需操作