快速入门

简介

MindStudio算子开发工具包含多个工具，如msKPP、msOpGen、msOpST、msSanitizer、msDebug和msProf等，本文档以一个简单样例介绍算子开发工具应用的全流程。

样例以单算子AclNN调用方式为例，介绍如何使用算子开发工具进行算子设计、算子工程创建、算子功能测试、算子异常检测、算子调试及性能调优。

环境准备

准备 Atlas A2 训练系列产品/Atlas 800I A2 推理产品的服务器，并安装对应的驱动和固件。具体安装过程请参见安装NPU驱动固件。
安装Ascend-cann-toolkit，具体安装过程请参见安装Toolkit开发套件包。
若要使用MindStudio Insight进行查看时，需要单独安装MindStudio Insight软件包，具体下载链接请参见安装与卸载。

${git_clone_path}为sample仓的安装路径。
${INSTALL_DIR}请替换为CANN软件安装后文件存储路径。若安装的Ascend-cann-toolkit软件包，以root安装举例，则安装后文件存储路径为：/usr/local/Ascend/ascend-toolkit/latest。
在安装昇腾AI处理器的服务器执行npu-smi info命令进行查询，获取Chip Name信息。实际配置值为AscendChip Name，例如Chip Name取值为xxxyy，实际配置值为Ascendxxxyy。当Ascendxxxyy为代码样例路径时，需要配置ascendxxxyy。
如果需要指令占比饼图（instruction_cycle_consumption.html），则需要安装生成饼图所依赖的python三方库plotly。
```
pip3 install plotly
```

算子设计（msKPP）

msKPP工具用于算子开发之前，帮助开发者在秒级时间内获取算子性能建模结果，可快速验证算子的实现方案。

参考环境准备，完成msKPP工具相关配置。

获取对算子建模的Python脚本（以add算子为例）。

        
         
           
           
             from mskpp import vadd, Tensor, Chip

def my_vadd(gm_x, gm_y, gm_z):
    # 向量Add的基本数据通路:
    #被加数x: GM-UB
    #加数y: GM-UB
    #结果向量z: UB-GM
 
    #定义和分配UB上的变量
    x = Tensor("UB")
    y = Tensor("UB")
    z = Tensor("UB")

    # 将GM上的数据移动到UB对应内存空间上
    x.load(gm_x)
    y.load(gm_y)

    # 当前数据已加载到UB上,调用指令进行计算,结果保存在UB上
    out = vadd(x, y, z)()

    # 将UB上的数据移动到GM变量gm_z的地址空间上 
    gm_z.load(out[0])

if __name__== '__main__':
    with Chip("Ascendxxxyy") as chip:  # xxxyy为用户实际使用的具体芯片类型 
        chip.enable_trace()
        chip.enable_metrics()

        # 应用算子进行AICORE计算
        in_x = Tensor("GM", "FP16", [32, 48], format="ND") 
        in_y = Tensor("GM", "FP16", [32, 48], format="ND")
        in_z = Tensor("GM", "FP16", [32, 48], format="ND")
        my_vadd(in_x, in_y, in_z)

            

          

        
       

执行步骤二的Python脚本，将会在当前目录生成以下结果目录。该文件具体内容的分析请参见算子计算搬运规格分析、极限性能分析和算子Tiling初步设计。

表1 建模结果文件
文件名称	功能
搬运流水统计（Pipe_statistic.csv）	以PIPE维度统计搬运数据量大小、操作数个数以及耗时信息。
指令信息统计（Instruction_statistic.csv）	统计不同指令维度的总搬运数据量大小、操作数个数以及耗时信息，能够发现指令层面上的瓶颈。
指令占比饼图（instruction_cycle_consumption.html）	以指令维度统计耗时信息，并以饼图形式展示。
指令流水图（trace.json）	以指令维度展示耗时信息，并进行可视化展示。

创建算子工程（msOpGen）

msOpGen工具用于算子开发时，可生成自定义算子工程，方便用户专注于算子的核心逻辑和算法实现，而无需花费大量时间在项目搭建、编译配置等重复性工作上，从而大大提高了开发效率。

生成算子目录。

把算子定义的AddCustom.json文件放到工作目录当中，json文件的配置参数详细说明请参考表1。

          
           
             
             
               [
    {
        "op": "AddCustom",
        "language": "cpp",
        "input_desc": [
            {
                "name": "x",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "float16"
                ]
            },
            {
                "name": "y",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "float16"
                ]
            }
        ],
        "output_desc": [
            {
                "name": "z",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "float16"
                ]
            }
        ]
    }
]

              

            

          
         

执行以下命令，生成算子开发工程，参数说明请参见表2。

msopgen gen -i AddCustom.json -f tf -c ai_core-ascendxxxyy -lan cpp -out AddCustom  # xxxyy为用户实际使用的具体芯片类型

执行以下命令，查看生成目录。
```
tree -C -L 2 AddCustom/
```

在指定目录下生成的算子工程目录。

          
           
             
             
               AddCustom
├── build.sh
├── cmake
├── CMakeLists.txt
├── CMakePresets.json
├── framework
│   ├── CMakeLists.txt
│   └── tf_plugin
├── op_host
│   ├── add_custom.cpp
│   ├── add_custom_tiling.h
│   └── CMakeLists.txt
├── op_kernel
│   ├── add_custom.cpp
│   └── CMakeLists.txt
└── scripts
    ├── help.info
    ├── install.sh
    └── upgrade.sh

              

            

          
         

单击Link，获取算子核函数开发和Tiling实现的代码样例。执行以下命令，将样例目录中的算子实现文件移动至msOpGen步骤1生成的目录中。
```
cp -r ${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AddCustom/* AddCustom/
```
- 完成算子工程创建后，需参考《Ascend C算子开发指南》进行算子开发，但此步骤只需体现算子开发工具的功能，因此直接使用代码样例。
- 下载代码样例时，需执行以下命令指定分支版本。
```
git clone https://gitee.com/ascend/samples.git -b v0.2-8.0.0.beta1
```
编译算子工程。
1. 参考编译前准备，完成编译相关配置。
2. 在算子工程目录下，执行如下命令，进行算子工程编译。
  编译完成后，将会在build_out目录生成.run算子包。
```
./build.sh
```
在自定义算子包所在路径下，执行如下命令，部署算子包。
```
./build_out/custom_opp_<target_os>_<target_architecture>.run
```

验证算子功能，生成可执行文件execute_add_op。

切换到AclNNInvocation仓的目录。

cd ${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation

执行以下命令。
```
./run.sh
```

成功对比精度，并生成可执行文件execute_add_op

          
           
             
             
               INFO: execute op!
[INFO]  Set device[0] success
[INFO]  Get RunMode[1] success
[INFO]  Init resource success
[INFO]  Set input success
[INFO]  Copy input[0] success
[INFO]  Copy input[1] success
[INFO]  Create stream success
[INFO]  Execute aclnnAddCustomGetWorkspaceSize success, workspace size 0
[INFO]  Execute aclnnAddCustom success
[INFO]  Synchronize stream success
[INFO]  Copy output[0] success
[INFO]  Write output success
[INFO]  Run op success
[INFO]  Reset Device success
[INFO]  Destroy resource success
INFO: acl executable run success!
error ratio: 0.0000, tolerance: 0.0010
test pass

              

            

          
         

算子功能测试（msOpST）

msOpST工具用于算子开发完成后，对算子功能进行初步测试，该工具可以更加高效地进行算子性能的分析和优化，提高算子的执行效率，降低开发成本。

本样例基于AscendCL接口的流程，生成单算子的OM文件，并执行该文件以验证算子执行结果的正确性。

生成ST测试用例。

在创建算子工程中的步骤2执行完成后，再执行以下命令，并根据msOpGen算子工程目录替换命令路径。

          
               msopst create -i "$HOME/AddCustom/op_host/add_custom.cpp" -out ./st

生成ST测试用例。

          
               2024-09-10 19:47:15 (3995495) - [INFO] Start to parse AscendC operator prototype definition in $HOME/AddCustom/op_host/add_custom.cpp.
2024-09-10 19:47:15 (3995495) - [INFO] Start to check valid for op info.
2024-09-10 19:47:15 (3995495) - [INFO] Finish to check valid for op info.
2024-09-10 19:47:15 (3995495) - [INFO] Generate test case file $HOME/AddCustom/st/AddCustom_case_20240910194715.json successfully.
2024-09-10 19:47:15 (3995495) - [INFO] Process finished!

在./st目录下生成ST测试用例。

执行ST测试。

根据CANN包路径设置环境变量。

export DDK_PATH=${INSTALL_DIR}
export NPU_HOST_LIB=${INSTALL_DIR}/{arch-os}/devlib

执行ST测试，并将输出结果到指定路径。

msopst run -i ./st/AddCustom_case_{TIMESTAMP}.json -soc Ascendxxxyy -out ./st/out   # xxxyy为用户实际使用的具体芯片类型

测试成功后，将测试结果输出在./st/out/xxxx/路径下的st.report.json文件，具体请参见表3。

算子异常检测（msSanitizer）

msSanitizer工具作用于算子开发的整个周期，帮助开发者确保算子的质量和稳定性。通过在早期阶段发现并修复异常，msSanitizer大大减少了产品上线后的潜在风险和后期维护成本。

启动工具后，将会在当前目录下自动生成工具运行日志文件mssanitizer_{TIMESTAMP}_{PID}.log，当用户程序运行完成后，界面将会打印异常报告。

在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch目录下执行以下命令，生成自定义算子工程，进行host侧和kernel侧的算子实现。
```
bash install.sh -v Ascendxxxyy    # xxxyy为用户实际使用的具体芯片类型
```
在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/CustomOp目录下执行以下命令，重新编译部署算子。
```
bash build.sh
./build_out/custom_opp_<target_os>_<target_architecture>.run   // 当前目录下run包的名称
```
切换到${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation目录，拉起算子API运行脚本，进行内存检测。

${git_clone_path}为sample仓的路径。
1. 启用内存检测：
  - 可显式指定内存检测，默认会开启非法读写、多核踩踏、非对齐访问和非法释放的检测功能：
```
mssanitizer --tool=memcheck bash run.sh
```
  - 执行如下命令，可手动启用内存泄漏的检测功能：
```
mssanitizer --tool=memcheck --leak-check=yes bash run.sh
```
2. 定位内存异常，具体请参见内存异常报告解析。
进行竞争检测。
1. 执行如下命令，启用竞争检测。
```
mssanitizer --tool=racecheck bash run.sh
```
2. 定位内存竞争，具体请参见竞争异常报告解析。
  当前目录下会自动生成工具运行日志文件mssanitizer_{TIMESTAMP}_{PID}.log，当用户程序运行完成后，界面将会打印异常报告。
进行未初始化检测。
1. 执行如下命令，可手动启用未初始化的检测。
```
mssanitizer --tool=initcheck bash run.sh 
```
2. 定位内存异常，具体请参见未初始化异常报告解析。

算子调试（msDebug）

msDebug支持调试所有昇腾算子，用户可以根据实际情况选择使用不同的功能，例如，可以设置断点、打印变量和内存、进行单步调试、中断运行、核切换等。

在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch目录下执行以下命令，生成自定义算子工程，进行host侧和kernel侧的算子实现。
```
bash install.sh -v Ascendxxxyy    # xxxyy为用户实际使用的具体芯片类型
```

在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/CustomOp目录下修改CMakePresets.json文件的cacheVariables的配置项，将"Release"修改为"Debug"。

        
                   "cacheVariables": {               
             "CMAKE_BUILD_TYPE": {                    
                 "type": "STRING",                    
                 "value": "Debug"               
       },

执行以下命令，重新编译部署算子。
1. 在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/CustomOp目录下执行以下命令，重新编译部署算子。
```
bash build.sh
./build_out/custom_opp_<target_os>_<target_architecture>.run  // 当前目录下run包的名称
```
2. 切换到${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation目录，并执行以下命令，将会在./output路径下生成可执行文件execute_add_op。
```
bash run.sh
cd  ./output
```

在调试前，配置如下环境变量，指定算子加载路径，导入调试信息。

export LAUNCH_KERNEL_PATH=${INSTALL_DIR}/opp/vendors/customize/op_impl/ai_core/tbe/kernel/${soc_version}/add_custom/AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o   //soc_version为昇腾AI处理器的Chip Name

指定算子依赖的动态库路径，将动态库so文件加载进来。

export LD_LIBRARY_PATH=$ASCEND_HOME_PATH/opp/vendors/customize/op_api/lib:$LD_LIBRARY_PATH

在可执行文件目录下执行msdebug execute_add_op，进入msdebug。
```
msdebug execute_add_op
```

断点设置。

设置断点。
```
(msdebug) b add_custom.cpp:55
```

回显将会显示断点信息添加成功。

          
               Breakpoint 1: where = AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o`KernelAdd::Compute(int) (.vector) + 68 at add_custom.cpp:55:9, address = 0x00000000000014f4

键盘输入 r 命令，运行算子程序，等待直到命中断点：

        
         
           
           
             (msdebug) r
Process 1454802 launched: '${INSTALL_DIR}/add_cus/AclNNInvocation/output/execute_add_op' (aarch64)
[INFO]  Set device[0] success
[INFO]  Get RunMode[1] success
[INFO]  Init resource success
[INFO]  Set input success
[INFO]  Copy input[0] success
[INFO]  Copy input[1] success
[INFO]  Create stream success
[INFO]  Execute aclnnAddCustomGetWorkspaceSize success, workspace size 0
[Launch of Kernel AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b on Device 0]
[INFO]  Execute aclnnAddCustom success
Process 1454802 stopped
[Switching to focus on Kernel AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b, CoreId 39, Type aiv]
* thread #1, name = 'execute_add_op', stop reason = breakpoint 1.1
    frame #0: 0x00000000000014f4 AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o`KernelAdd::Compute(this=0x00000000003078a8, progress=0) (.vector) at add_custom.cpp:55:9
   52       __aicore__ inline void Compute(int32_t progress)
   53       {
   54           LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
-> 55           LocalTensor<DTYPE_Y> yLocal = inQueueY.DeQue<DTYPE_Y>();   //断点处的行号正确即可，其余信息以实际为准
   56           LocalTensor<DTYPE_Z> zLocal = outQueueZ.AllocTensor<DTYPE_Z>();
   57           Add(zLocal, xLocal, yLocal, this->tileLength);
   58           outQueueZ.EnQue<DTYPE_Z>(zLocal);

            

          

        
       

继续运行。

键盘输入以下命令，继续运行。
```
(msdebug) c
```

显示程序再次命中该断点。

          
               Process 1454802 resuming
Process 1454802 stopped
[Switching to focus on Kernel AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b, CoreId 39, Type aiv]
* thread #1, name = 'execute_add_op', stop reason = breakpoint 1.1
    frame #0: 0x00000000000014f4 AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o`KernelAdd::Compute(this=0x00000000003078a8, progress=0) (.vector) at add_custom.cpp:55:9
   52       __aicore__ inline void Compute(int32_t progress)
   53       {
   54           LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
-> 55           LocalTensor<DTYPE_Y> yLocal = inQueueY.DeQue<DTYPE_Y>();   //断点处的行号正确即可，其余信息以实际为准
   56           LocalTensor<DTYPE_Z> zLocal = outQueueZ.AllocTensor<DTYPE_Z>();
   57           Add(zLocal, xLocal, yLocal, this->tileLength);
   58           outQueueZ.EnQue<DTYPE_Z>(zLocal);

结束调试：
```
(msdebug) q
```

算子调优（msProf）

msProf工具主要作用于算子开发的性能优化阶段，通过使用msProf工具，开发者可以确保算子在不同硬件平台上都能高效运行，从而提高软件的整体性能和用户体验。

在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch目录下执行以下命令，生成自定义算子工程，进行host侧和kernel侧的算子实现。
```
bash install.sh -v Ascendxxxyy    # xxxyy为用户实际使用的具体芯片类型
```
在${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/CustomOp目录下执行以下命令行，重新编译部署算子。
```
bash build.sh
./build_out/custom_opp_<target_os>_<target_architecture>.run  // 当前目录下run包的名称
```
切换到${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation目录，执行以下命令，生成可执行文件：
```
./run.sh
```

指定算子依赖的动态库路径，将动态库so文件加载进来。

export LD_LIBRARY_PATH=$ASCEND_HOME_PATH/opp/vendors/customize/op_api/lib:$LD_LIBRARY_PATH

使用msprof op进行上板调优。

进入${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation/output目录，执行以下命令，开启上板调优。
```
msprof op --output=./output_data ./execute_add_op
```

生成以下结果目录。

          
               OPPROF_20240911145000_YLKFDJDQNXGDTXPH/
├── ArithmeticUtilization.csv
├── dump
├── L2Cache.csv
├── Memory.csv
├── MemoryL0.csv
├── MemoryUB.csv
├── OpBasicInfo.csv
├── PipeUtilization.csv
├── ResourceConflictRatio.csv
└── visualize_data.bin

将visualize_data.bin文件导入MindStudio Insight工具，将上板结果可视化，具体请参见msprof op。

请参考msprof op simulator配置的“算子调优（msProf）> msProf > 使用前准备 ”章节，完成仿真配置。

使用msprof op simulator进行仿真调优。

进入${git_clone_path}/samples/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocation/output目录，执行以下命令，开启仿真调优。
```
msprof op simulator --soc-version=Ascendxxxyy --output=./output_data ./execute_add_op
```

生成以下结果目录。

          
               OPPROF_20240911150827_GYCKQHGDUHJFYICF/
├── dump
└── simulator
    ├── core0.veccore0
    ├── core0.veccore1
    ├── core1.veccore0
    ├── core1.veccore1
    ├── core2.veccore0
    ├── core2.veccore1
    ├── core3.veccore0
    ├── core3.veccore1
    ├── trace.json
    └── visualize_data.bin

将trace.json和visualize_data.bin文件导入MindStudio Insight工具，将仿真结果可视化，具体请参见msprof op simulator。