概述

本章节包含常用自定义接口，包括创建tensor及计算类操作。

表1 torch_npu API
API名称	说明
（beta）torch_npu._npu_dropout	不使用种子（seed）进行dropout结果计数。与torch.dropout相似，优化NPU设备实现。
（beta）torch_npu.copy_memory_	从src拷贝元素到self张量，并返回self。
（beta）torch_npu.empty_with_format	返回一个填充未初始化数据的张量。
（beta）torch_npu.fast_gelu	计算输入张量中fast_gelu的梯度。
（beta）torch_npu.npu_alloc_float_status	为溢出检测模式申请tensor作为入参。
（beta）torch_npu.npu_anchor_response_flags	在单个特征图中生成锚点的责任标志。
（beta）torch_npu.npu_apply_adam	adam结果计数。
（beta）torch_npu.npu_batch_nms	根据batch分类计算输入框评分，通过评分排序，删除评分高于阈值（iou_threshold）的框，支持多批多类处理。通过NonMaxSuppression（nms）操作可有效删除冗余的输入框，提高检测精度。NonMaxSuppression：抑制不是极大值的元素，搜索局部的极大值，常用于计算机视觉任务中的检测类模型。
（beta）torch_npu.npu_bert_apply_adam	adam结果计数。
（beta）torch_npu.npu_bmmV2	将矩阵“a”乘以矩阵“b”，生成“a*b”。
（beta）torch_npu.npu_bounding_box_decode	根据rois和deltas生成标注框。自定义FasterRcnn算子。
（beta）torch_npu.npu_bounding_box_encode	计算标注框和ground truth真值框之间的坐标变化。自定义FasterRcnn算子。
（beta）torch_npu.npu_broadcast	返回self张量的新视图，其单维度扩展，结果连续。张量也可以扩展更多维度，新的维度添加在最前面。
（beta）torch_npu.npu_ciou	应用基于NPU的CIoU操作。在DIoU的基础上增加了penalty item，并propose CIoU。
（beta）torch_npu.npu_clear_float_status	清除溢出检测相关标志位。
（beta）torch_npu.npu_confusion_transpose	混淆reshape和transpose运算。
（beta）torch_npu.npu_conv_transpose2d	在由多个输入平面组成的输入图像上应用一个2D转置卷积算子，有时这个过程也被称为“反卷积”。
（beta）torch_npu.npu_conv2d	在由多个输入平面组成的输入图像上应用一个2D卷积。
（beta）torch_npu.npu_conv3d	在由多个输入平面组成的输入图像上应用一个3D卷积。
（beta）torch_npu.npu_convolution	在由多个输入平面组成的输入图像上应用一个2D或3D卷积。
（beta）torch_npu.npu_convolution_transpose	在由多个输入平面组成的输入图像上应用一个2D或3D转置卷积算子，有时这个过程也被称为“反卷积”。
（beta）torch_npu.npu_deformable_conv2d	使用预期输入计算变形卷积输出（deformed convolution output）。
（beta）torch_npu.npu_diou	应用基于NPU的DIoU操作。考虑到目标之间距离，以及距离和范围的重叠率，不同目标或边界需趋于稳定。
（beta）torch_npu.npu_dtype_cast	执行张量数据类型（dtype）转换。
（beta）torch_npu.npu_format_cast	修改NPU张量的格式。
（beta）torch_npu.npu_format_cast_	原地修改self张量格式，与src格式保持一致。
（beta）torch_npu.npu_get_float_status	获取溢出检测结果。
（beta）torch_npu.npu_giou	首先计算两个框的最小封闭面积和IoU，然后计算封闭区域中不属于两个框的封闭面积的比例，最后从IoU中减去这个比例，得到GIoU。
（beta）torch_npu.npu_grid_assign_positive	执行position-sensitive的候选区域池化梯度计算。
（beta）torch_npu.npu_gru	计算DynamicGRUV2。
（beta）torch_npu.npu_ifmr	使用“begin,end,strides”数组对ifmr结果进行计数。
（beta）torch_npu.npu_indexing	使用“begin,end,strides”数组对index结果进行计数。
（beta）torch_npu.npu_iou	根据ground-truth和预测区域计算交并比（IoU）或前景交叉比（IoF）。
（beta）torch_npu.npu_layer_norm_eval	对层归一化结果进行计数。与torch.nn.functional.layer_norm相同, 优化NPU设备实现。
（beta）torch_npu.npu_linear	将矩阵“a”乘以矩阵“b”，生成“a*b”。
（beta）torch_npu.npu_lstm	计算DynamicRNN。
（beta）torch_npu.npu_masked_fill_range	同轴上被range.boxes屏蔽（masked）的填充张量。自定义屏蔽填充范围算子。
（beta）torch_npu.npu_max	使用dim对最大结果进行计数。类似于torch.max, 优化NPU设备实现。
（beta）torch_npu.npu_min	使用dim对最小结果进行计数。类似于torch.min, 优化NPU设备实现。
（beta）torch_npu.npu_mish	按元素计算self的双曲正切。
（beta）torch_npu.npu_nms_rotated	按分数降序选择旋转标注框的子集。
（beta）torch_npu.npu_nms_v4	按分数降序选择标注框的子集。
（beta）torch_npu.npu_nms_with_mask	生成值0或1，用于nms算子确定有效位。
（beta）torch_npu.npu_normalize_batch	执行批量归一化。
（beta）torch_npu.npu_one_hot	返回一个one-hot张量。input中index表示的位置采用on_value值，而其他所有位置采用off_value的值。
（beta）torch_npu.npu_pad	填充张量。
（beta）torch_npu.npu_ps_roi_pooling	执行Position Sensitive ROI Pooling。
（beta）torch_npu.npu_ptiou	根据ground-truth和预测区域计算交并比（IoU）或前景交叉比（IoF）。
（beta）torch_npu.npu_random_choice_with_mask	混洗非零元素的index。
（beta）torch_npu.npu_reshape	reshape张量。仅更改张量shape，其数据不变。
（beta）torch_npu.npu_roi_align	从特征图中获取ROI特征矩阵。自定义FasterRcnn算子。
（beta）torch_npu.npu_rotated_box_decode	旋转标注框编码。
（beta）torch_npu.npu_rotated_box_encode	旋转标注框编码。
（beta）torch_npu.npu_rotated_iou	计算旋转框的IoU。
（beta）torch_npu.npu_rotated_overlaps	计算旋转框的重叠面积。
（beta）torch_npu.npu_scatter	使用dim对scatter结果进行计数。类似于torch.scatter，优化NPU设备实现。
（beta）torch_npu.npu_sign_bits_pack	将float类型1位Adam打包为uint8。
（beta）torch_npu.npu_sign_bits_unpack	将uint8类型1位Adam拆包为float。
（beta）torch_npu.npu_silu	计算self的Swish。
（beta）torch_npu.npu_slice	从张量中提取切片。
（beta）torch_npu.npu_softmax_cross_entropy_with_logits	计算softmax的交叉熵cost。
（beta）torch_npu.npu_sort_v2	沿给定维度，按无index值对输入张量元素进行升序排序。若dim未设置，则选择输入的最后一个维度。如果descending为True，则元素将按值降序排序。
（beta）torch_npu.npu_stride_add	添加两个张量的partial values，格式为NC1HWC0。
（beta）torch_npu.npu_transpose	返回原始张量视图，其维度已permute，结果连续。
（beta）torch_npu.npu_yolo_boxes_encode	根据YOLO的锚点框（anchor box）和真值框（ground-truth box）生成标注框。自定义mmdetection算子。
（beta）torch_npu.npu_fused_attention_score	实现“Transformer attention score”的融合计算逻辑，主要将matmul、transpose、add、softmax、dropout、batchmatmul、permute等计算进行了融合。
（beta）torch_npu.npu_multi_head_attention	实现Transformer模块中的MultiHeadAttention计算逻辑。
（beta）torch_npu.npu_rms_norm	RmsNorm算子是大模型常用的归一化操作，相比LayerNorm算子，其去掉了减去均值的部分。
torch_npu.npu_anti_quant	将INT8数据反量化为FP16。
torch_npu.npu_fusion_attention	实现“Transformer Attention Score”的融合计算。
torch_npu.npu_mm_all_reduce_base	TP切分场景下，实现mm和all_reduce的融合，融合算子内部实现计算和通信流水并行。
torch_npu.npu_ffn	该算子兼容MoeFFN和FFN。该算子在没有专家分组（expert_tokens为空）时是FFN，有专家分组时是MoeFFN，统称为FFN，属于Moe结构。
torch_npu.npu_incre_flash_attention	增量FA实现。
torch_npu.npu_prompt_flash_attention	全量FA实现。
torch_npu.npu_trans_quant_param	完成量化计算参数scale数据类型的转换。
torch_npu.npu_quant_matmul	完成量化的矩阵乘计算，最小支持输入维度为2维，最大支持输入维度为6维。
torch_npu.npu_weight_quant_batchmatmul	该接口用于实现矩阵乘计算中的weight输入和输出的量化操作，支持pertensor，perchannel，pergroup多场景量化。
torch_npu.npu_grouped_matmul	GroupedMatmul算子可以实现分组矩阵乘计算，每组矩阵乘的维度大小可以不同，是一种灵活的支持方式。
torch_npu.npu_quant_scatter	先将updates进行量化，然后将updates中的值按指定的轴axis和索引indices更新self中的值，并将结果保存到输出tensor，self本身的数据不变。
torch_npu.npu_quant_scatter_	先将updates进行量化，然后将updates中的值按指定的轴axis和索引indices更新self中的值，self中的数据被改变。
torch_npu.npu_scatter_nd_update	将updates中的值按指定的索引indices更新self中的值，并将结果保存到输出tensor，self本身的数据不变。
torch_npu.npu_scatter_nd_update_	将updates中的值按指定的索引indices更新self中的值，并将结果保存到输出tensor，self中的数据被改变。
（beta）torch_npu.npu_dropout_with_add_softmax	实现axpy_v2、softmax_v2、drop_out_domask_v3功能。
torch_npu.npu_rotary_mul	实现RotaryEmbedding旋转位置编码。
torch_npu.npu_scaled_masked_softmax	计算输入张量x缩放并按照mask遮蔽后的Softmax结果。
（beta）torch_npu.npu_swiglu	提供swiglu的激活函数。
（beta）torch_npu.one_	用1填充self张量。

父主题： torch_npu