用于梯度计算的变量被in-place操作
问题现象
打屏日志中存在关键字“one of the variables needed for gradient computation has been modified by an inplace operation”,类似如下屏显信息:
1 2 3 4 5 6 7 8 9 10 |
ERROR: test_autograd_backward (__main__.TestMode) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2388, in wrapper method(*args, **kwargs) File "npu/test_fault_mode.py", line 159, in test_autograd_backward torch.autograd.grad(d2.sum(), a) File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/autograd/__init__.py", line 394, in grad result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [5]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). |
故障根因
关键过程:调用torch.autograd.backward的过程中失败。
根本原因分析:原地操作是指直接在原有张量上进行修改,而不创建新的副本。这样做会导致梯度无法正确计算,从而引发上述错误。
处理方法
根据日志信息找到报错的代码行,将原地操作改为非原地操作。比如:将 x += 2 改为 y = x + 2。
Error Code |
无 |
---|---|
故障事件名称 |
用于梯度计算的变量被inplace操作 |
故障解释/可能原因 |
代码脚本问题 |
故障影响 |
反向传播无法正常计算 |
故障自处理模式 |
将原地操作的算子改为非原地操作 |
系统处理建议 |
无需操作 |
父主题: 故障案例集