单算子性能仿真流水图

CAModel性能仿真支持设置仿真超时时间、设置运行blockdim数目等功能，详情参见CAModel性能仿真。

本场景以AddCustom算子为例，CAModel仿真调测过程如下。请根据自身实际情况，按需修改示例代码。

import torch
import numpy as np
import ascendebug

# 设置和清理日志文件
ascendebug.set_log_file('test.log', clean=True)

# 1.生成输入/标杆数据
x = torch.rand(size=(1, 16384), dtype=torch.float16)
y = torch.rand(size=(1, 16384), dtype=torch.float16)
z = x + y

# 2.构建算子信息
debug_op = ascendebug.create_debug_op('add_custom', 'VectorCore', '${chip_version}') \
        .scalar_input('tileNumIn', 'uint32', 10) \
        .tensor_input('x', x) \
        .tensor_input('y', y) \
        .tensor_output('z', z)

# 3.创建调试对象并初始化工作空间
install_pkg = "/home/run_pkg/"
op_executor = ascendebug.create_op_executor(install_path=install_pkg)

# 4.配置核函数源码信息
kernel_info = ascendebug.OpKernelInfo("/path_to/add_custom.cpp", 'add_custom', [])

# 5.调用NPU编译接口生成kernel.o文件
npu_option = ascendebug.CompileNpuOptions(simulator=True)
kernel_name, kernel_file, extern = op_executor.compile_call_kernel_npu(kernel_info, npu_option)

# 6.调用CAModel运行接口，生成算子仿真流水图
# CAModel仿真运行算子一般比较耗时，建议将block_num设置为1，timeout取值适当调大
npu_compile_info = ascendebug.NpuCompileInfo(syncall=extern['cross_core_sync'], task_ration=extern['task_ration'])
run_simulator_options = ascendebug.RunSimuOptions(block_num=1, timeout=1200)
op_executor.run_camodel(kernel_file, run_simulator_options, npu_compile_info=npu_compile_info)

CAModel的性能仿真结果和流水图示例可以参见“CAModel性能仿真 > 调测产物”。

父主题： 性能调优