快速入门

对于ops_adv算子工程场景，调测流程如图1所示，支持的调测功能有Tiling调测、CPU孪生调试、NPU编译生成kernel bin文件、NPU上板精度比对、NPU上板Profiling数据采集、性能仿真流水图等。

图1 ops_adv工程场景API调用流程

环境准备的具体步骤参见环境准备。
基于ops_adv代码框架完成算子开发。
本章以ops_adv仓中提供的flash_attention_score算子为例。
准备好输入数据和标杆数据。可使用现成的bin格式数据文件，也可使用torch/numpy生成Tensor数据（具体参见API方式下数据准备说明）。

构建算子信息。

调用ascendebug.create_debug_op接口构造算子DebugOp对象，并设置输入/输出信息，示例如下：

import ascendebug
DATA_PATH = '/user_data_path/'
debug_op = ascendebug.create_debug_op('FlashAttentionScore', 'MixCore', 'Ascendxxx') \
    .custom_input('query', 'float16', [24, 144, 1280], os.path.join(DATA_PATH, 'q.bin')) \
    .custom_input('key', 'float16', [24, 144, 1280], os.path.join(DATA_PATH, 'k.bin')) \
    .custom_input('value', 'float16', [24, 144, 1280], os.path.join(DATA_PATH, 'v.bin')) \
    .custom_input('real_shift', 'float16', None, None, ['optional']) \
    .custom_input('drop_mask', 'uint8', [1244160], os.path.join(DATA_PATH, 'drop_mask.bin'), ['optional']) \
    .custom_input('padding_mask', 'float16', None, None, ['optional']) \
    .custom_input('atten_mask', 'bool', None, None, ['optional']) \
    .custom_input('prefix', 'int64', None, None, ['optional']) \
    .custom_input('actual_seq_qlen', 'int64', None, None, ['optional']) \
    .custom_input('actual_seq_kvlen', 'int64', None, None, ['optional']) \
    .custom_input('q_start_idx', 'int64', None, None, ['optional']) \
    .custom_input('kv_start_idx', 'int64', None, None, ['optional']) \
    .custom_output('softmax_max', 'float32', [24, 20, 144, 8], None) \
    .custom_output('softmax_sum', 'float32', [24, 20, 144, 8], None) \
    .custom_output('softmax_out', 'float16', [24, 20, 144, 144], None) \
    .custom_output('attention_out', 'float16', [24, 20, 144, 64], os.path.join(DATA_PATH, 'attention_out.bin')) \
    .attr('scale_value', 'float', 1.0) \
    .attr('keep_prob', 'float', 0.8) \
    .attr('pre_tockens', 'int', 2147483647) \
    .attr('next_tockens', 'int', 2147483647) \
    .attr('head_num', 'int', 20) \
    .attr('input_layout', 'string', 'BSH') \
    .attr('inner_precise', 'int', 0)

创建算子调试器对象，示例如下：

op_executor = ascendebug.create_op_executor(debug_op=debug_op, work_dir='./debug_workspace', install_path='/usr/local/Ascend')

构造输入参数，调用调测API，以CPU调测接口为例。

cpu_options = ascendebug.CpuOptions()
op_executor.run_ops_adv_cpu(repo_path, tiling_info, cpu_options)

使用的API接口列表

本场景涉及的所有调测API如表1所示。

表1 ops_adv工程场景的调测API列表
调测使用的API	说明
create_debug_op	根据输入的op_type、core_type等信息构造DebugOp对象，管理算子相关描述信息。
create_op_executor	构建调测对象，完成工作空间初始化，设置环境变量等调测相关的操作。
compile_ops_adv_tiling	ops_adv算子工程场景的Tiling编译接口，用于将本地代码编译生成Tiling so文件。
run_tiling	通用的Tiling运行接口。
run_ops_adv_tiling	ops_adv算子工程场景的Tiling运行接口。自动从CANN包中获取Tiling so后直接运行本接口进行Tiling函数计算。
run_ops_adv_cpu	ops_adv算子工程场景下算子CPU侧编译和运行接口。
compile_ops_adv_npu	ops_adv算子工程场景下算子NPU侧编译接口，生成kernel.o编译产物。
run_npu	通用的算子NPU上板运行接口。
run_camodel	通用的CAModel运行接口。
run_profiling	通用的Profiling运行接口。

父主题： ops_adv工程场景的算子调测示例