NPU自定义算子

仅支持通过torch_npu.xxx调用NPU自定义算子。torch.xxxtorch.Tensor.xxx的调用形式已经被废弃,使用以上方法调用可能会出现告警,如图1所示。
图1 告警提示
表1 NPU自定义算子

序号

算子名称

1

torch_npu._npu_dropout

2

torch_npu.copy_memory_

3

torch_npu.empty_with_format

4

torch_npu.fast_gelu

5

torch_npu.npu_alloc_float_status

6

torch_npu.npu_anchor_response_flags

7

torch_npu.npu_apply_adam

8

torch_npu.npu_batch_nms

9

torch_npu.npu_bert_apply_adam

10

torch_npu.npu_bmmV2

11

torch_npu.npu_bounding_box_decode

12

torch_npu.npu_bounding_box_encode

13

torch_npu.npu_broadcast

14

torch_npu.npu_ciou

15

torch_npu.npu_clear_float_status

16

torch_npu.npu_confusion_transpose

17

torch_npu.npu_conv_transpose2d

18

torch_npu.npu_conv2d

19

torch_npu.npu_conv3d

20

torch_npu.npu_convolution

21

torch_npu.npu_convolution_transpose

22

torch_npu.npu_deformable_conv2d

23

torch_npu.npu_diou

24

torch_npu.npu_dtype_cast

25

torch_npu.npu_format_cast

26

torch_npu.npu_format_cast_

27

torch_npu.npu_get_float_status

28

torch_npu.npu_giou

29

torch_npu.npu_grid_assign_positive

30

torch_npu.npu_gru

31

torch_npu.npu_ifmr

32

torch_npu.npu_indexing

33

torch_npu.npu_iou

34

torch_npu.npu_layer_norm_eval

35

torch_npu.npu_linear

36

torch_npu.npu_lstm

37

torch_npu.npu_masked_fill_range

38

torch_npu.npu_max

39

torch_npu.npu_min

40

torch_npu.npu_nms_rotated

41

torch_npu.npu_nms_v4

42

torch_npu.npu_nms_with_mask

43

torch_npu.npu_normalize_batch

44

torch_npu.npu_one_hot

45

torch_npu.npu_pad

46

torch_npu.npu_ps_roi_pooling

47

torch_npu.npu_ptiou

48

torch_npu.npu_random_choice_with_mask

49

torch_npu.npu_reshape

50

torch_npu.npu_roi_align

51

torch_npu.npu_rotated_box_decode

52

torch_npu.npu_rotated_box_encode

53

torch_npu.npu_rotated_iou

54

torch_npu.npu_rotated_overlaps

55

torch_npu.npu_scatter

56

torch_npu.npu_sign_bits_pack

57

torch_npu.npu_sign_bits_unpack

58

torch_npu.npu_silu

59

torch_npu.npu_slice

60

torch_npu.npu_softmax_cross_entropy_with_logits

61

torch_npu.npu_sort_v2

62

torch_npu.npu_stride_add

63

torch_npu.npu_transpose

64

torch_npu.npu_yolo_boxes_encode

65

torch_npu.npu_rotary_mul

66

torch_npu.npu_fused_attention_score

67

torch_npu.npu_scaled_masked_softmax

68

torch_npu.npu_dropout_with_add_softmax

69

torch.npu_multi_head_attention

70

torch_npu. npu_geglu

71

torch_npu.npu_rms_norm

映射关系

NPU自定义算子参数中存在部分映射关系可参考下表。

表2 映射关系表

参数

映射参数

说明

ACL_FORMAT_UNDEFINED

-1

Format参数映射值。

ACL_FORMAT_NCHW

0

ACL_FORMAT_NHWC

1

ACL_FORMAT_ND

2

ACL_FORMAT_NC1HWC0

3

ACL_FORMAT_FRACTAL_Z

4

ACL_FORMAT_NC1HWC0_C04

12

ACL_FORMAT_HWCN

16

ACL_FORMAT_NDHWC

27

ACL_FORMAT_FRACTAL_NZ

29

ACL_FORMAT_NCDHW

30

ACL_FORMAT_NDC1HWC0

32

ACL_FRACTAL_Z_3D

33

详细算子接口说明

torch_npu.npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v))

adam结果计数。

torch_npu.npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D或3D转置卷积算子,有时这个过程也被称为“反卷积”。

torch_npu.npu_conv_transpose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D转置卷积算子,有时这个过程也被称为“反卷积”。

torch_npu.npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D或3D卷积。

torch_npu.npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D卷积。

torch_npu.npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个3D卷积。

torch_npu.one_(self) -> Tensor

用1填充self张量。

torch_npu.npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor

沿给定维度,按无index值对输入张量元素进行升序排序。若dim未设置,则选择输入的最后一个维度。如果descending为True,则元素将按值降序排序。

torch_npu.npu_format_cast(self, acl_format) -> Tensor

修改NPU张量的格式。

torch_npu.npu_format_cast_(self, src) -> Tensor

原地修改self张量格式,与src格式保持一致。

torch_npu.npu_transpose(self, perm, require_contiguous=True) -> Tensor

返回原始张量视图,其维度已permute,结果连续。

torch_npu.npu_broadcast(self, size) -> Tensor

返回self张量的新视图,其单维度扩展,结果连续。

张量也可以扩展更多维度,新的维度添加在最前面。

torch_npu.npu_dtype_cast(input, dtype) -> Tensor

执行张量数据类型(dtype)转换。

torch_npu.empty_with_format(size, dtype, layout, device, pin_memory, acl_format)

返回一个填充未初始化数据的张量。

torch_npu.copy_memory_(dst, src, non_blocking=False) -> Tensor

从src拷贝元素到self张量,并返回self。

torch_npu.npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor

返回一个one-hot张量。input中index表示的位置采用on_value值,而其他所有位置采用off_value的值。

torch_npu.npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor

添加两个张量的partial values,格式为NC1HWC0。

torch_npu.npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor

计算softmax的交叉熵cost。

torch_npu.npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor

执行Position Sensitive ROI Pooling。

torch_npu.npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor

从特征图中获取ROI特征矩阵。自定义FasterRcnn算子。

torch_npu.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor)

按分数降序选择标注框的子集。

torch_npu.npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor)

按分数降序选择旋转标注框的子集。

torch_npu.npu_lstm(x, weight, bias, seqMask, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction)

计算DynamicRNN。

torch_npu.npu_iou(bboxes, gtboxes, mode=0) -> Tensor 
torch_npu.npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor

根据ground-truth和预测区域计算交并比(IoU)或前景交叉比(IoF)。

torch_npu.npu_pad(input, paddings) -> Tensor

填充张量。

torch_npu.npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor)

生成值0或1,用于nms算子确定有效位。

torch_npu.npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor

计算标注框和ground truth真值框之间的坐标变化。自定义FasterRcnn算子。

torch_npu.npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor

根据rois和deltas生成标注框。自定义FasterRcnn算子。

torch_npu.npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)

计算DynamicGRUV2。

torch_npu.npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor)

混洗非零元素的index。

torch_npu.npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor)

根据batch分类计算输入框评分,通过评分排序,删除评分高于阈值(iou_threshold)的框,支持多批多类处理。通过NonMaxSuppression(nms)操作可有效删除冗余的输入框,提高检测精度。NonMaxSuppression:抑制不是极大值的元素,搜索局部的极大值,常用于计算机视觉任务中的检测类模型。

torch_npu.npu_slice(self, offsets, size) -> Tensor

从张量中提取切片。

torch_npu._npu_dropout(self, p) -> (Tensor, Tensor)

不使用种子(seed)进行dropout结果计数。与torch.dropout相似,优化NPU设备实现。

torch_npu.npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor

使用“begin,end,strides”数组对index结果进行计数。

torch_npu.npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor)

使用“begin,end,strides”数组对ifmr结果进行计数。

torch_npu.npu_max(self, dim, keepdim=False) -> (Tensor, Tensor)

使用dim对最大结果进行计数。类似于torch.max, 优化NPU设备实现。

torch_npu.npu_min(self, dim, keepdim=False) -> (Tensor, Tensor)

使用dim对最小结果进行计数。类似于torch.min, 优化NPU设备实现。

torch_npu.npu_scatter(self, indices, updates, dim) -> Tensor

使用dim对scatter结果进行计数。类似于torch.scatter,优化NPU设备实现。

torch_npu.npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor

对层归一化结果进行计数。与torch.nn.functional.layer_norm相同, 优化NPU设备实现。

torch_npu.npu_alloc_float_status(self) -> Tensor

生成一个包含8个0的一维张量。

torch_npu.npu_get_float_status(self) -> Tensor

计算npu_get_float_status算子函数。

torch_npu.npu_clear_float_status(self) -> Tensor

在每个核中设置地址0x40000的值为0。

torch_npu.npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor

混淆reshape和transpose运算。

torch_npu.npu_bmmV2(self, mat2, output_sizes) -> Tensor

将矩阵“a”乘以矩阵“b”,生成“a*b”。

torch_npu.fast_gelu(self) -> Tensor

计算输入张量中fast_gelu的梯度。

torch_npu.npu_deformable_conv2d(self, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor)

使用预期输入计算变形卷积输出(deformed convolution output)。

torch_npu.npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor

在单个特征图中生成锚点的责任标志。

torch_npu.npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor

根据YOLO的锚点框(anchor box)和真值框(ground-truth box)生成标注框。自定义mmdetection算子。

torch_npu.npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor

执行position-sensitive的候选区域池化梯度计算。

torch_npu.npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor

执行批量归一化。

torch_npu.npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor

同轴上被range.boxes屏蔽(masked)的填充张量。自定义屏蔽填充范围算子。

torch_npu.npu_linear(input, weight, bias=None) -> Tensor

将矩阵“a”乘以矩阵“b”,生成“a*b”。

torch_npu.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=(var,m,v))

adam结果计数。

torch_npu.npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensor

首先计算两个框的最小封闭面积和IoU,然后计算封闭区域中不属于两个框的封闭面积的比例,最后从IoU中减去这个比例,得到GIoU。

torch_npu.npu_silu(self) -> Tensor

计算self的Swish。

torch_npu.npu_reshape(self, shape, bool can_refresh=False) -> Tensor

reshape张量。仅更改张量shape,其数据不变。

torch_npu.npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor

计算旋转框的重叠面积。

torch_npu.npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True,v_threshold=0.0, e_threshold=0.0) -> Tensor

计算旋转框的IoU。

torch_npu.npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor

旋转标注框编码。

torch_npu.npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor

旋转标注框编码。

torch_npu.npu_ciou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=True, int mode=0, bool atan_sub_flag=False) -> Tensor

应用基于NPU的CIoU操作。在DIoU的基础上增加了penalty item,并propose CIoU。

torch_npu.npu_diou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=False, int mode=0) -> Tensor

应用基于NPU的DIoU操作。考虑到目标之间距离,以及距离和范围的重叠率,不同目标或边界需趋于稳定。

torch_npu.npu_sign_bits_pack(Tensor self, int size) -> Tensor

将float类型1位Adam打包为uint8。

torch_npu.npu_sign_bits_unpack(x, dtype, size) -> Tensor

将uint8类型1位Adam拆包为float。

torch_npu.npu_rotary_mul(Tensor x, Tensor r1, Tensor r2): -> Tensor
实现RotaryEmbedding旋转位置编码。
x1, x2 = torch.chunk(x, 2, -1)
x_new = torch.cat((-x2, x1), dim=-1)
output = r1 * x + r2 * x_new
  • 参数:
    • Tensor x:4维张量,shape为(B, N, S, D)。
    • Tensor r1:4维张量cos角度,shape为(X, X, X, D),支持除最后一轴外任意轴广播。
    • Tensor r2:4维张量sin角度,shape为(X, X, X, D), 支持除最后一轴外任意轴广播。
  • 约束条件:

    x,r1,r2的最后一维必须是64的倍数。

  • 示例:
        >>>x = torch.rand(2, 2, 5, 128).npu()
        >>>tensor([[[[0.8594, 0.4914, 0.9075,  ..., 0.2126, 0.6520, 0.2206],
              [0.5515, 0.3353, 0.6568,  ..., 0.3686, 0.1457, 0.8528],
              [0.0504, 0.2687, 0.4036,  ..., 0.3032, 0.8262, 0.6302],
              [0.0537, 0.5141, 0.7016,  ..., 0.4948, 0.9778, 0.8535],
              [0.3602, 0.7874, 0.9913,  ..., 0.1474, 0.3422, 0.6830]],
    
            [[0.4641, 0.6254, 0.7415,  ..., 0.1834, 0.1067, 0.7171],
              [0.8084, 0.7570, 0.4728,  ..., 0.4603, 0.4991, 0.1723],
              [0.0483, 0.6931, 0.0935,  ..., 0.7522, 0.0054, 0.1736],
              [0.6196, 0.1028, 0.7076,  ..., 0.2745, 0.9943, 0.6971],
              [0.3267, 0.3748, 0.1232,  ..., 0.0507, 0.4302, 0.6249]]],
     
     
             [[[0.2783, 0.8262, 0.6014,  ..., 0.8040, 0.7986, 0.2831],
              [0.6035, 0.2955, 0.7711,  ..., 0.7464, 0.3739, 0.6637],
              [0.6282, 0.7243, 0.5445,  ..., 0.3755, 0.0533, 0.9468],
              [0.5179, 0.3967, 0.6558,  ..., 0.0267, 0.5549, 0.9707],
              [0.4388, 0.7458, 0.2065,  ..., 0.6080, 0.4242, 0.8879]],
     
             [[0.3428, 0.6976, 0.0970,  ..., 0.9552, 0.3663, 0.2139],
              [0.2019, 0.2452, 0.1142,  ..., 0.3651, 0.6993, 0.5257],
              [0.9636, 0.1691, 0.4807,  ..., 0.9137, 0.3510, 0.0905],
              [0.0177, 0.9496, 0.1560,  ..., 0.7437, 0.9043, 0.0131],
              [0.9699, 0.5352, 0.9763,  ..., 0.1850, 0.2056, 0.0368]]]],
           device='npu:0')
        >>>r1 = torch.rand(1, 2, 1, 128).npu()
           tensor([[[[0.8433, 0.5262, 0.2608, 0.8501, 0.7187, 0.6944, 0.0193, 0.1507,
               0.0450, 0.2257, 0.4679, 0.8309, 0.4740, 0.8715, 0.7443, 0.3354,
               0.5533, 0.9151, 0.4215, 0.4631, 0.9076, 0.3093, 0.0270, 0.7681,
               0.1800, 0.0847, 0.6965, 0.2059, 0.8806, 0.3987, 0.8446, 0.6225,
               0.1375, 0.8765, 0.5965, 0.3092, 0.0193, 0.9220, 0.4997, 0.8170,
               0.8575, 0.5525, 0.8528, 0.7262, 0.4026, 0.5704, 0.0390, 0.9240,
               0.9780, 0.3927, 0.7343, 0.3922, 0.5004, 0.8561, 0.6021, 0.6530,
               0.6565, 0.9988, 0.4238, 0.0092, 0.5131, 0.5257, 0.1649, 0.0272,
               0.9103, 0.2476, 0.7573, 0.8500, 0.9348, 0.4306, 0.3612, 0.5378,
               0.7141, 0.3559, 0.6620, 0.3335, 0.4000, 0.2479, 0.3490, 0.7000,
               0.5321, 0.3485, 0.9162, 0.9207, 0.3262, 0.7929, 0.1258, 0.6689,
               0.1023, 0.1938, 0.3887, 0.6893, 0.0849, 0.3700, 0.5747, 0.9674,
               0.4520, 0.5313, 0.0377, 0.1202, 0.9326, 0.0442, 0.4651, 0.7036,
               0.3994, 0.9332, 0.5104, 0.0930, 0.4481, 0.8753, 0.5597, 0.6068,
               0.9895, 0.5833, 0.6771, 0.4255, 0.4513, 0.6330, 0.9070, 0.3103,
               0.0609, 0.8202, 0.6031, 0.3628, 0.1118, 0.2747, 0.4521, 0.8347]],
     
             [[0.6759, 0.8744, 0.3595, 0.2361, 0.4899, 0.3769, 0.6809, 0.0101,
               0.0730, 0.0576, 0.5242, 0.5510, 0.9780, 0.4704, 0.9607, 0.1699,
               0.3613, 0.6096, 0.0246, 0.6088, 0.4984, 0.9788, 0.2026, 0.1484,
               0.3086, 0.9697, 0.8166, 0.9566, 0.9874, 0.4547, 0.5250, 0.2041,
               0.7784, 0.4269, 0.0110, 0.6878, 0.6575, 0.3382, 0.1889, 0.8344,
               0.9608, 0.6153, 0.4812, 0.0547, 0.2978, 0.3610, 0.5285, 0.6162,
               0.2123, 0.1364, 0.6027, 0.7450, 0.2485, 0.2149, 0.7849, 0.8886,
               0.0514, 0.9511, 0.4865, 0.8380, 0.6947, 0.2378, 0.5839, 0.8434,
               0.0871, 0.4179, 0.1669, 0.8703, 0.1946, 0.0302, 0.9516, 0.1208,
               0.5780, 0.6859, 0.2405, 0.5083, 0.3872, 0.7649, 0.1329, 0.0252,
               0.2404, 0.5456, 0.7009, 0.6524, 0.7623, 0.5965, 0.0437, 0.4080,
               0.8390, 0.4172, 0.4781, 0.2405, 0.1502, 0.2020, 0.4192, 0.8185,
               0.0899, 0.1961, 0.7368, 0.4798, 0.4303, 0.9281, 0.5410, 0.0620,
               0.8945, 0.3589, 0.5637, 0.4875, 0.1523, 0.9478, 0.9040, 0.3410,
               0.3591, 0.2702, 0.5949, 0.3337, 0.3578, 0.8890, 0.6608, 0.6578,
               0.4953, 0.7975, 0.2891, 0.9552, 0.0092, 0.1293, 0.2362, 0.7821]]]],
           device='npu:0')
     
       >>>r2 = torch.rand(1, 2, 1, 128).npu()
     
     
     
           tensor([[[[6.4270e-01, 1.3050e-01, 9.6509e-01, 1.4090e-01, 1.8660e-01,
               8.7512e-01, 3.8567e-01, 4.1776e-01, 9.7718e-01, 5.6305e-01,
               6.3091e-01, 4.6385e-01, 1.8142e-01, 3.7779e-01, 3.8012e-01,
               8.1807e-01, 3.3292e-01, 5.8488e-01, 5.8188e-01, 5.7776e-01,
               5.1828e-01, 9.6087e-01, 7.2219e-01, 8.5045e-02, 3.6623e-01,
               3.3758e-01, 7.9666e-01, 6.9932e-01, 9.9202e-01, 2.5493e-01,
               2.3017e-01, 7.9396e-01, 5.0109e-01, 6.5580e-01, 3.2200e-01,
               7.8023e-01, 4.4098e-01, 1.0576e-01, 8.0548e-01, 2.2453e-01,
               1.4705e-01, 8.7682e-02, 4.7264e-01, 8.9034e-02, 8.5720e-01,
               4.7576e-01, 2.8438e-01, 8.6523e-01, 8.1707e-02, 3.0075e-01,
               4.9069e-01, 9.7404e-01, 9.3865e-01, 5.7160e-01, 1.6332e-01,
               4.3868e-01, 5.8658e-01, 5.3993e-01, 3.8271e-02, 9.9662e-01,
               2.2892e-01, 7.8558e-01, 9.4502e-01, 9.7633e-01, 1.7877e-01,
               2.6446e-02, 2.3411e-01, 6.7531e-01, 1.5023e-01, 4.4280e-02,
               1.4457e-01, 3.6683e-01, 4.3424e-01, 7.4145e-01, 8.2433e-01,
               6.8660e-01, 6.7477e-01, 5.5000e-02, 5.1344e-01, 9.3115e-01,
               3.8280e-01, 9.2177e-01, 4.5470e-01, 2.5540e-01, 4.6632e-01,
               8.3960e-01, 4.4320e-01, 1.0808e-01, 7.5544e-01, 4.6372e-01,
               1.4322e-01, 1.9141e-01, 5.5918e-02, 7.0804e-01, 1.8789e-01,
               9.4276e-01, 9.1742e-01, 9.1980e-01, 6.2728e-01, 4.1787e-01,
               7.9545e-01, 9.0569e-01, 7.9123e-01, 9.7596e-01, 7.2507e-01,
               2.3772e-01, 8.2560e-01, 5.9359e-01, 7.1134e-01, 5.1029e-01,
               6.1601e-01, 2.9094e-01, 3.4174e-01, 9.0532e-01, 5.0960e-01,
               3.4441e-01, 7.0498e-01, 4.2729e-01, 7.6714e-01, 6.3755e-01,
               8.4604e-01, 5.9109e-01, 7.9137e-01, 7.5149e-01, 2.2092e-01,
               9.5235e-01, 3.6915e-01, 6.4961e-01]],
     
             [[8.7862e-01, 1.1325e-01, 2.4575e-01, 9.7429e-01, 1.9362e-01,
               8.2297e-01, 3.5184e-02, 5.2755e-01, 7.6429e-01, 2.4700e-01,
               6.2860e-01, 2.4555e-01, 4.4557e-01, 7.0955e-03, 2.0326e-01,
               8.6354e-02, 3.5959e-01, 3.4059e-01, 8.6852e-01, 1.3858e-01,
               6.8500e-01, 1.3601e-01, 7.3152e-01, 8.3474e-01, 2.7017e-01,
               9.8078e-01, 6.1084e-01, 1.6540e-01, 4.3081e-01, 8.5738e-01,
               4.1890e-01, 6.6872e-01, 3.1698e-01, 4.2576e-02, 1.5236e-01,
               2.0526e-01, 1.9493e-01, 6.6122e-03, 1.8332e-01, 5.6981e-01,
               5.4090e-01, 6.0783e-01, 5.8742e-01, 9.1761e-04, 2.0904e-01,
               6.6419e-01, 9.9559e-01, 5.8233e-01, 6.8562e-01, 8.6456e-01,
               9.9931e-01, 3.5637e-01, 2.4642e-01, 2.3428e-02, 6.9037e-01,
               1.7560e-01, 1.8703e-01, 3.5244e-01, 6.3031e-01, 1.8450e-01,
               9.2194e-01, 9.3016e-02, 9.0488e-01, 2.4294e-02, 5.1122e-01,
               5.0793e-01, 7.9585e-01, 7.9594e-02, 5.2137e-01, 9.8359e-01,
               7.5022e-01, 4.1925e-01, 3.3284e-01, 4.7939e-01, 9.9081e-01,
               3.3931e-01, 2.6461e-01, 5.3063e-01, 1.0328e-01, 8.0720e-01,
               9.9480e-01, 3.0833e-01, 5.6780e-01, 3.9551e-01, 6.7176e-01,
               4.8049e-01, 1.5653e-01, 1.7595e-02, 6.6493e-02, 5.1989e-01,
               8.2691e-01, 7.3295e-01, 5.7169e-01, 4.9911e-01, 1.0260e-01,
               5.2307e-01, 7.4247e-01, 1.1682e-01, 5.8123e-01, 7.3496e-02,
               6.4274e-02, 2.4704e-01, 6.0424e-02, 2.6161e-01, 7.7966e-01,
               7.1244e-01, 2.2077e-01, 5.0723e-01, 9.6665e-01, 6.0933e-01,
               8.1332e-01, 3.0749e-01, 2.1297e-02, 3.6734e-01, 9.2542e-01,
               1.3554e-01, 9.7240e-01, 4.4344e-01, 4.2534e-01, 4.6205e-01,
               6.1811e-01, 5.8800e-01, 5.4673e-01, 1.2535e-01, 2.9959e-01,
               4.4890e-01, 2.7185e-01, 5.0243e-01]]]], device='npu:0')
     
        >>>out = torch_npu.npu_rotary_mul(x, r1, r2)
           tensor([[[[ 0.1142,  0.1891, -0.4689,  ...,  0.5704,  0.5375,  0.6079],
              [ 0.2264,  0.1155, -0.7678,  ...,  0.9857,  0.3382,  0.9441],
              [-0.1902,  0.1329, -0.3613,  ...,  0.9793,  0.5628,  0.8669],
              [-0.3349,  0.1532,  0.1124,  ...,  0.3125,  0.6741,  1.1248],
              [-0.0473,  0.2978, -0.6940,  ...,  0.2753,  0.2604,  1.0379]],
     
             [[ 0.0136,  0.4723,  0.1371,  ...,  0.1081,  0.2462,  0.6316],
              [ 0.0769,  0.6558, -0.0734,  ...,  0.2714,  0.2221,  0.2195],
              [-0.3755,  0.5364, -0.1131,  ...,  0.3105,  0.1225,  0.6166],
              [ 0.3535,  0.0164,  0.0095,  ...,  0.1361,  0.2570,  0.5811],
              [-0.2992,  0.2981,  0.0242,  ...,  0.2881,  0.2367,  0.9582]]],
     
     
            [[[ 0.1699,  0.3589, -0.7443,  ...,  0.4751,  0.7291,  0.2717],
              [ 0.3657,  0.0397,  0.1818,  ...,  0.9113,  0.4130,  0.8279],
              [-0.0657,  0.2528, -0.6658,  ...,  0.8184,  0.2057,  1.2864],
              [-0.1058,  0.1859, -0.0998,  ...,  0.0662,  0.5590,  1.0525],
              [ 0.2651,  0.3719, -0.8170,  ...,  0.2789,  0.3916,  1.0407]],
     
             [[-0.5998,  0.5740, -0.0154,  ...,  0.1746,  0.1982,  0.6338],
              [ 0.0766,  0.1790, -0.1490,  ...,  0.4387,  0.2592,  0.4924],
              [ 0.4765,  0.0485, -0.0226,  ...,  0.2219,  0.3445,  0.2265],
              [-0.1006,  0.8073, -0.1540,  ...,  0.1045,  0.2633,  0.2194],
              [ 0.0157,  0.3997,  0.3131,  ...,  0.0538,  0.0647,  0.4821]]]],
           device='npu:0')
torch_npu.npu_fused_attention_score(Tensor query_layer, Tensor key_layer, Tensor value_layer, Tensor attention_mask, Scalar scale, float keep_prob, bool query_transpose=False, bool key_transpose=False, bool bmm_score_transpose_a=False, bool bmm_score_transpose_b=False, bool value_transpose=False, bool dx_transpose=False) -> Tensor
实现“Transformer attention score”的融合计算逻辑,主要将matmul、transpose、add、softmax、dropout、batchmatmul、permute等计算进行了融合。
  • 参数:
    • query_layer:Tensor类型,仅支持float16。
    • key_layer:Tensor类型,仅支持float16。
    • value_layer:Tensor类型,仅支持float16 。
    • attention_mask:Tensor类型,仅支持float16 。
    • scale:缩放系数,浮点数标量 。
    • keep_prob:不做dropout的概率,0-1之间,浮点数。
    • query_transpose:query是否做转置,bool类型,默认为False 。
    • key_transpose:key是否做转置,bool类型,默认为False 。
    • bmm_score_transpose_a:bmm计算中左矩阵是否做转置,bool类型,默认为False。
    • bmm_score_transpose_b:bmm计算中右矩阵是否做转置,bool类型,默认为False。
    • value_transpose:value是否做转置,bool类型,默认为False。
    • dx_transpose:反向计算时dx是否做转置,bool类型,默认为False。
  • 约束条件:

    输入tensor的格式编号必须均为29,数据类型为FP16。

  • 示例:
    >>> import torch
    >>> import torch_npu
    >>> query_layer = torch_npu.npu_format_cast(torch.rand(24, 16, 512, 64).npu() , 29).half()
    >>> query_layer = torch_npu.npu_format_cast(torch.rand(24, 16, 512, 64).npu(), 29).half()
    >>> key_layer = torch_npu.npu_format_cast(torch.rand(24, 16, 512, 64).npu(), 29).half()
    >>> value_layer = torch_npu.npu_format_cast(torch.rand(24, 16, 512, 64).npu(), 29).half()
    >>> attention_mask = torch_npu.npu_format_cast(torch.rand(24, 16, 512, 512).npu(), 29).half()
    >>> scale = 0.125
    >>> keep_prob = 0.5
    >>> context_layer = torch_npu.npu_fused_attention_score(query_layer, key_layer, value_layer, attention_mask, scale, keep_prob)
    >>> print(context_layer)
            tensor([[0.5063, 0.4900, 0.4951,  ..., 0.5493, 0.5249, 0.5400],
                   [0.4844, 0.4724, 0.4927,  ..., 0.5176, 0.4702, 0.4790],
                   [0.4683, 0.4771, 0.5054,  ..., 0.4917, 0.4614, 0.4739],
                   ...,
                   [0.5137, 0.5010, 0.5078,  ..., 0.4656, 0.4592, 0.5034],
                   [0.5425, 0.5732, 0.5347,  ..., 0.5054, 0.5024, 0.4924],
torch_npu.npu_scaled_masked_softmax(x, mask, scale=1.0, fixed_triu_mask=False) -> Tensor

计算输入张量x缩放并按照mask遮蔽后的Softmax结果。

torch_npu.npu_dropout_with_add_softmax(Tensor self, Tensor x1, Scalar alpha, float prob, int dim) -> (Tensor, Tensor, Tensor)

实现axpy_v2、softmax_v2、drop_out_domask_v3功能。即:

y=x1+ self *alpha

Softmax(xi)= exp(xi)/∑jexp(xj)

output = 根据mask舍弃x中的元素,留下来的元素乘(1/prob)

 - func: npu_multi_head_attention(Tensor query, Tensor key, Tensor value, Tensor query_weight, Tensor key_weight, Tensor value_weight, Tensor attn_mask, Tensor out_proj_weight, Tensor? query_bias, Tensor? key_bias, Tensor? value_bias, Tensor? out_proj_bias, Tensor? dropout_mask, int attn_head_num, int attn_dim_per_head, int src_len, int tgt_len, float dropout_prob, bool softmax_use_float) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)
    impl_ns: acl_op

实现Transformer模块中的MultiHeadAttention计算逻辑。

torch_npu. npu_geglu(Tensor self, int dim=-1, int approximate=1) -> (Tensor, Tensor)

对输入Tensor完成GeGlu运算。

其中,A和B是对输入self沿指定轴平均切分,dim默认值为-1

torch_npu.npu_rms_norm(Tensor self, Tensor gamma, float epsilon=1e-06) -> (Tensor, Tensor) 

RmsNorm算子是大模型常用的归一化操作,相比LayerNorm算子,其去掉了减去均值的部分 。其计算公式为