下载
中文
注册

用于梯度计算的变量被in-place操作

问题现象

打屏日志中存在关键字“one of the variables needed for gradient computation has been modified by an inplace operation”,类似如下屏显信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ERROR: test_autograd_backward (__main__.TestMode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2388, in wrapper
    method(*args, **kwargs)
  File "npu/test_fault_mode.py", line 159, in test_autograd_backward
    torch.autograd.grad(d2.sum(), a)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/autograd/__init__.py", line 394, in grad
    result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [5]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

故障根因

关键过程:调用torch.autograd.backward的过程中失败。

根本原因分析:原地操作是指直接在原有张量上进行修改,而不创建新的副本。这样做会导致梯度无法正确计算,从而引发上述错误。

处理方法

根据日志信息找到报错的代码行,将原地操作改为非原地操作。比如:将 x += 2 改为 y = x + 2。

Error Code

故障事件名称

用于梯度计算的变量被inplace操作

故障解释/可能原因

代码脚本问题

故障影响

反向传播无法正常计算

故障自处理模式

将原地操作的算子改为非原地操作

系统处理建议

无需操作