溢出检测场景

溢出检测是针对NPU的PyTorch API，检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。

溢出检测原理：针对溢出阶段，开启acl dump模式，重新对溢出阶段执行，落盘数据。

参考工具安装，完成ptdbg_ascend组件包安装。

在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。

进行全量溢出检测。

from ptdbg_ascend import PrecisionDebugger
debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0])
debugger.configure_hook(overflow_nums=-1)
# 请勿将以上初始化流程插入到循环代码中

# 模型初始化
# 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop()
debugger.start()

# 需要dump的代码片段1

debugger.stop()
debugger.start()

# 需要dump的代码片段2

debugger.stop()
debugger.step()

多卡使用时各卡单独计算溢出次数。

dump指定前向API的ACL级别溢出数据。

from ptdbg_ascend import PrecisionDebugger
debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0])
debugger.configure_hook(mode="acl", acl_config="./dump.json")
# 请勿将以上初始化流程插入到循环代码中

# 模型初始化
# 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop()
debugger.start()

# 需要dump的代码片段1

debugger.stop()
debugger.start()

# 需要dump的代码片段2

debugger.stop()
debugger.step()

dump指定反向API的ACL级别的溢出数据。

进行全量溢出检测。

from ptdbg_ascend import PrecisionDebugger
debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0])
debugger.configure_hook(overflow_nums=-1)
# 请勿将以上初始化流程插入到循环代码中

# 模型初始化
# 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop()
debugger.start()

# 需要dump的代码片段1

debugger.stop()
debugger.start()

# 需要dump的代码片段2

debugger.stop()
debugger.step()

dump指定反向API的ACL级别的溢出数据。

from ptdbg_ascend import PrecisionDebugger
debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="dump", step=[0])
debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./overflow_dump/ptdbg_dump_v4.0/step0/rank0/Functional_conv2d_1_backward_1/Functional_conv2d_1_backward_input.0.npy"])
# 请勿将以上初始化流程插入到循环代码中

# 模型初始化
# 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop()
debugger.start()

# 需要dump的代码片段1

debugger.stop()
debugger.start()

# 需要dump的代码片段2

debugger.stop()
debugger.step()

针对前向溢出API，可以通过overflow_nums，配置允许的溢出次数，并将每次溢出API的全部ACL数据dump下来，到达指定溢出次数后停止，停止后会看到堆栈打印包含如下字段。

ValueError: [overflow xxx times]: dump file is saved in 'xxxxx.pkl'.

其中xxx times为用户设置的次数，xxxxx.pkl为文件生成路径。

NPU环境下执行训练dump溢出数据。
针对输入正常但输出存在溢出的API，会训练执行目录下将溢出的API信息dump并保存为forward_info_{pid}.json和backward_info_{pid}.json，通过Ascend模型精度预检工具对json文件进行解析，输出溢出API为正常溢出还是非正常溢出，从而帮助用户快速判断。

精度预检工具执行命令如下：
```
# 下载att代码仓后执行如下命令
export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/
cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut
python run_overflow_check.py -forward ./forward_info_0.json
```
反向过程溢出的API暂不支持精度预检功能。

当重复执行溢出检测dump操作时，需要删除上一次dump目录下的溢出检测dump数据，否则将因重名而报错。
- dump_mode="acl"场景下，会增加npu的内存消耗，请谨慎开启。
- 部分API存在调用嵌套关系，比如functional.batch_norm实际调用torch.batch_norm，该场景会影响acl init初始化多次，导致功能异常。
- 混合精度动态loss scale场景下，正常训练会有"Gradient overflow. SKipping step"日志，添加溢出检测后日志消失，可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1，并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持，不支持版本执行报错EZ3003。

父主题： 精度比对工具说明