（beta）torch_npu.npu.stress_detect

此接口为beta接口，属于实验性接口，部分场景下可能出现异常，请谨慎使用此接口。

接口原型

torch_npu.npu.stress_detect()

功能描述

提供硬件精度在线检测接口，供模型调用。主要通过StressDetect接口实现，该接口会对硬件做压力测试检测是否存在静默精度问题。

输出说明

返回值为int，代表错误类型，含义如下所示：

0：在线硬件精度检测通过。

1：一小时内多次调用硬件精度检测接口，跳过检测。

其他：在线硬件精度检测不通过，硬件故障。

约束说明

硬件精度在线检测的使用需要修改用户的模型训练脚本，建议在训练开始前、结束后、两个step之间调用，同时需要预留2G大小的内存供压测接口使用。
硬件精度在线检测在集群所有节点之间要同步并发执行，所有节点的执行时间偏差控制在1秒级（推荐），不然会存在慢节点导致NofityWait超时问题出现。
硬件精度检测用例存在提前检测出芯片Weak点，可能会提前时间N个月发现并送厂返修，会提高硬件返修率。
假如没有其他方式获取恢复的CKPT点，建议回滚到上一个硬件精度检测的时间点那刻CKPT。
目前AIC硬件失效率万卡15天出现一次（1.5个NPU压测出故障/月/万卡），这个客户在训练脚本中评估用例执行频率，建议保证一天（24小时）执行一次。
硬件精度检测用例建议一直运行，基于硬件失效的浴盆曲线，即使在盆底还是存在失效的，只是失效率变低，周期变长。
需要根据调压的压测效果决策是否需要保留调压。当前离线AIC压测工具检测效果率为70-80%。
压测用例执行需要做防呆保护（最小执行间隔1小时），防止用户长期执行用例，影响训练性能。
硬件精度在线检测用例仅支持Atlas A2 训练系列产品，不支持在同一节点运行多个训练作业，同时调压功能不支持算力切分场景; 不建议使用多线程运行在线精度检测用例。
在线精度检测用例执行的流（Stream）需要与训练主流（Stream）分开，确保检测用例异常不影响训练主流。
BUS电压设备热复位不支持自动恢复，需要设备上下电才能恢复，建议做训练作业前环境检测。
调压需要配套MCU 23.3.8及以上版本。
在进行硬件精度检测的时候会触发SOC调压（偏移芯片额定电压），SOC调压之后需要重新初始化ACG，而初始化ACG时需要调整频率（1850M->1300M），故而可以看到频率变化，甚至超频。

支持的型号

Atlas A2 训练系列产品

调用示例

import torch
import torch_npu

# Custom exception for stress detection failure
class StressDetectionException(Exception):
    def __init__(self, error_code):
        super().__init__(f"Stress detection failed with error code: {error_code}")

# Simple example of model training
def train_model(model, dataloader, optimizer, loss_fn, num_epochs):
    for epoch in range(num_epochs):
        model.train()

        running_loss = 0.0
        for inputs, labels in dataloader:
            inputs, labels = inputs.to("npu"), labels.to("npu")

            # Clear gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        # Call hardware stress detection after each epoch
        stress_detect_result = torch_npu.npu.stress_detect()
        if stress_detect_result == 0:
            print(f"Epoch {epoch + 1}/{num_epochs}: Stress detection passed.")
        elif stress_detect_result == 1:
            print(f"Epoch {epoch + 1}/{num_epochs}: Stress detection skipped (called too frequently).")
        else:
            # Raise an exception for any other non-zero result
            raise StressDetectionException(stress_detect_result)

        print(f"Epoch {epoch + 1} Loss: {running_loss/len(dataloader)}")

    print("Training complete.")

# Define a simple model
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = torch.nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# Sample dataloader
dataloader = [ (torch.randn(32, 10), torch.randn(32, 1)) for _ in range(100) ]

# Create model and move it to Ascend device
model = SimpleModel().to("npu")
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()

# Train the model and call stress detection
try:
    train_model(model, dataloader, optimizer, loss_fn, num_epochs=10)
except StressDetectionException as e:
    print(f"Training halted due to: {e}")
    # do something

父主题： torch_npu.npu