GatherMask

函数功能

以内置固定模式对应的二进制或者用户自定义输入的Tensor数值对应的二进制为gather mask（数据收集的掩码），从源操作数中选取元素写入目的操作数中。

函数原型

用户自定义模式

        
             template <typename T, typename U, GatherMaskMode mode = defaultGahterMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

内置固定模式

        
             template <typename T, GatherMaskMode mode = defaultGahterMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

参数说明

表1 参数说明
参数名称	输入/输出	含义
dstLocal	输出	目的操作数。类型为LocalTensor，支持的TPosition为VECIN/VECCALC/VECOUT。 Atlas推理系列产品AI Core，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t Atlas A2训练系列产品/Atlas 800I A2推理产品，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t Atlas 200/500 A2推理产品，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t
src0Local	输入	源操作数。类型为LocalTensor，支持的TPosition为VECIN/VECCALC/VECOUT。数据类型需要与目的操作数保持一致。 Atlas推理系列产品AI Core，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t Atlas A2训练系列产品/Atlas 800I A2推理产品，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t Atlas 200/500 A2推理产品，支持的数据类型为：half/uint16_t/int16_t/float/uint32_t/int32_t
src1Pattern	输入	gather mask（数据收集的掩码），分为内置固定模式和用户自定义模式两种，根据内置固定模式对应的二进制或者用户自定义输入的Tensor数值对应的二进制从源操作数中选取元素写入目的操作数中。1为选取，0为不选取。内置固定模式：src1Pattern数据类型为uint8_t，取值范围为[1,7]，所有repeat迭代使用相同的gather mask。 1：01010101…0101 # 每个repeat取偶数索引元素 2：10101010…1010 # 每个repeat取奇数索引元素 3：00010001…0001 # 每个repeat内每四个元素取第一个元素 4：00100010…0010 # 每个repeat内每四个元素取第二个元素， 5：01000100…0100 # 每个repeat内每四个元素取第三个元素 6：10001000…1000 # 每个repeat内每四个元素取第四个元素 7：11111111...1111 # 每个repeat内取全部元素 Atlas推理系列产品AI Core支持模式1-6 Atlas A2训练系列产品/Atlas 800I A2推理产品支持模式1-7 Atlas 200/500 A2推理产品支持模式1-7 用户自定义模式：src1Pattern数据类型为LocalTensor，支持的数据类型为uint16_t/uint32_t，迭代间间隔由src1RepeatStride决定，迭代内src1Pattern连续消耗。支持两种配置：当目的操作数数据类型为half/uint16_t/int16_t时，src1Pattern应为uint16_t数据类型。当目的操作数数据类型为float/uint32_t/int32_t时，src1Pattern应为uint32_t数据类型。
reduceMode	输入	mask模式选择参数，支持的数据类型：bool，取值为： false：mask为normal mode，该模式下mask必须设置为0，src1Pattern支持内置固定模式和用户自定义模式。 true：mask为counter mode。
mask	输入	mask值仅在reduceMode为true时生效。根据reduceMode，分为两种模式： normal mode：该模式下，mask无效，需要设置为0。一次repeat操作128(half/int16_t/uint16_t)或64(float/int32_t/uint32_t)个元素。 counter mode：取值范围[1, 2*32 – 1]。不同的型号counter模式下，参数表示含义不同。 mask值代表每一次repeat计算的元素个数；repeatTimes值生效；总的数据计算量为：repeatTimes mask；src0RepeatStride只在每一次repeat之间(repeat每次的计算量为mask个element)有效，在单次repeat内部，src0BlockStride有效。上述含义说明适用于以下型号： Atlas A2训练系列产品/Atlas 800I A2推理产品 Atlas 200/500 A2推理产品 mask值代表总的计算元素个数。repeatTimes值不生效，指令的迭代次数由源操作数和mask共同决定。src0RepeatStride只在每一次repeat之间(repeat每次的计算量为256bytes)有效，在单次repeat内部，src0BlockStride有效。上述含义说明适用于以下型号： Atlas推理系列产品AI Core
gatherMaskParams	输入	控制操作数地址步长的数据结构。结构体内包含操作数相邻迭代间相同datablock的地址步长，操作数同一迭代内不同datablock的地址步长等参数。数据结构的定义如下： struct GatherMaskParams{ uint8_t src0BlockStride; uint16_t repeatTimes; uint16_t src0RepeatStride; uint8_t src1RepeatStride; }; 相邻迭代间的地址步长参数说明请参考repeatStride（相邻迭代间相同datablock的地址步长）；同一迭代内datablock的地址步长参数说明请参考dataBlockStride（同一迭代内不同datablock的地址步长）。
rsvdCnt	输出	该条指令筛选后保留下来的元素计数，对应dstLocal中有效元素个数，数据类型为uint64_t。

表2 模板参数
参数名称	输入/输出	含义
mode	输入	预留参数，为后续功能做预留，当前提供默认值，用户无需设置该参数。

返回值

无

支持的型号

Atlas推理系列产品AI Core

Atlas A2训练系列产品/Atlas 800I A2推理产品

Atlas 200/500 A2推理产品

注意事项

为了节省地址空间，开发者可以定义一个Tensor，供源操作数与目的操作数同时使用（即地址重叠），相关约束如下：
- 单次迭代内，要求源操作数和目的操作数之间100%重叠，不支持部分重叠。
- 多次迭代间，第N次目的操作数是第N+1次源操作数的情况下，不支持地址重叠。
操作数地址偏移对齐要求请参见通用约束。
若调用该接口前为counter模式，在调用该接口后需要显示设置回counter模式(接口内部结束后会设置为norm模式)。

调用示例

用户自定义Tensor样例

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ uint16_t*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t));
        pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint16_t));
        pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>();
        AscendC::LocalTensor<uint16_t> src1Local = inQueueSrc1.AllocTensor<uint16_t>();
        AscendC::DataCopy(src0Local, src0Global, 128);
        AscendC::DataCopy(src1Local, src1Global, 32);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>();
        AscendC::LocalTensor<uint16_t> src1Local = inQueueSrc1.DeQue<uint16_t>();
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>();
 
        uint32_t mask = 128;
        uint64_t rsvdCnt = 0;
        // reduceMode = true; 使用counter模式
        // src0BlockStride = 1; 单次迭代内数据间隔1个datablock，即数据连续读取和写入
        // repeatTimes = 1;该参数在counter模式下不生效
        // src0RepeatStride = 8;源操作数迭代间数据间隔8个datablock
        // src1RepeatStride = 8;源操作数迭代间数据间隔8个datablock
        AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 1, 8, 8 }, rsvdCnt);
 
        outQueueDst.EnQue<uint16_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>();
        
        AscendC::DataCopy(dstGlobal, dstLocal, 128);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint16_t> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

结果示例如下：

输入数据(src0Local): [1 2 3 ... 128]
                     // 43690对应的二进制：0b1010 1010 1010 1010
输入数据(src1Local): [43690 43690 43690 43690 43690 43690 43690 43690 0 0 0 0 ...0]
输出数据(dstLocal): [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefined ..undefined]
输出数据(rsvdCnt): 64

内置固定模式

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t));
        pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>();
        AscendC::DataCopy(src0Local, src0Global, 128);
        inQueueSrc0.EnQue(src0Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>();
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>();
 
        uint32_t mask = 0; // normal模式下mask需要设置为0
        uint64_t rsvdCnt = 0; // 用于保存筛选后保留下来的元素个数
        uint8_t src1Pattern = 2; // 内置固定模式
        // reduceMode = false; 使用normal模式
        // src0BlockStride = 1; 单次迭代内数据间隔1个Block，即数据连续读取和写入
        // repeatTimes = 1;重复迭代一次
        // src0RepeatStride = 0;重复一次，故设置为0
        // src1RepeatStride = 0;重复一次，故设置为0
        AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt);
 
        outQueueDst.EnQue<uint16_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>();
        
        AscendC::DataCopy(dstGlobal, dstLocal, 128);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal;
};
 
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, dstGm);
    op.Process();
}

结果示例如下：

输入数据(src0Local): [1 2 3 ... 128]
输入数据(src1Pattern): src1Pattern = 2;
输出数据(dstLocal): [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefine ..undefined]
输出数据(rsvdCnt): 64

父主题： 选择指令