Matmul模板参数

功能说明

创建Matmul对象时需要传入：

A、B、C、Bias的参数类型信息，类型信息通过MatmulType来定义，包括：内存逻辑位置、数据格式、数据类型、是否转置、数据排布和是否使能L1复用。
MatmulConfig信息（可选），用于配置Matmul模板信息以及相关的配置参数。不配置默认使能Norm模板。
针对Atlas 200/500 A2推理产品，当前只支持使能默认的Norm模板。
MatmulCallBack回调函数信息（可选），用于配置左右矩阵从GM拷贝到L1、计算结果从CO1拷贝到GM的自定义函数。当前支持如下产品型号：
Atlas A2训练系列产品/Atlas 800I A2推理产品
MatmulPolicy信息（可选），预留参数。

实现原理

以输入矩阵A (GM, ND, half)、矩阵B(GM, ND, half)，输出矩阵C (GM, ND, float)，无Bias场景为例，其中(GM, ND, half)表示数据存放在GM上，数据格式为ND，数据类型为half，描述Matmul高阶API典型场景的内部算法框图，如下图所示。

图1 Matmul算法框图

计算过程分为如下几步：

数据从GM搬到A1：DataCopy每次从矩阵A，搬出一个stepM*baseM*stepKa*baseK的矩阵块a1，循环多次完成矩阵A的搬运；数据从GM搬到B1：DataCopy每次从矩阵B，搬出一个stepKb*baseK*stepN*baseN的矩阵块b1，循环多次完成矩阵B的搬运；
数据从A1搬到A2：LoadData每次从矩阵块a1，搬出一个baseM * baseK的矩阵块a0；数据从B1搬到B2，并完成转置：LoadData每次从矩阵块b1，搬出一个baseK * baseN的矩阵块，并将其转置为baseN * baseK的矩阵块b0；
矩阵乘：每次完成一个矩阵块a0 * b0的计算，得到baseM * baseN的矩阵块co1；
数据从矩阵块co1搬到矩阵块co2： DataCopy每次搬运一块baseM * baseN的矩阵块co1到singleCoreM * singleCoreN的矩阵块co2中；
重复2-4步骤，完成矩阵块a1 * b1的计算；
数据从矩阵块co2搬到矩阵块C：DataCopy每次搬运一块singleCoreM * singleCoreN的矩阵块co2到矩阵块C中；
重复1-6步骤，完成矩阵A * B = C的计算。

函数原型

      
           template <class A_TYPE, class B_TYPE, class C_TYPE, class BIAS_TYPE = C_TYPE, const MatmulConfig& MM_CFG = CFG_NORM, class MM_CB = MatmulCallBackFunc<nullptr, nullptr, nullptr>, MATMUL_POLICY_DEFAULT_OF(MatmulPolicy)>
using Matmul = matmul::MatmulImpl<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, MM_CFG, MM_CB, MATMUL_POLICY>;

A_TYPE、B_TYPE、C_TYPE类型信息通过MatmulType来定义。
MatmulConfig（可选），Matmul模板信息，具体内容见MatmulConfig。
MM_CB（可选），用于支持不同的搬入搬出需求，实现定制化的搬入搬出功能，如非连续搬入或针对搬出设置不同的数据片段间隔等。MM_CB中3个函数指针分别对应计算结果从CO1拷贝到GM、左矩阵从GM拷贝到L1、右矩阵从GM拷贝到L1的回调函数指针，各个功能回调函数接口定义及参数释义见表2。

MATMUL_POLICY_DEFAULT_OF(MatmulPolicy)（可选），当前为预留参数，无需配置。

        
             #define MATMUL_POLICY_DEFAULT_OF(DEFAULT)      \
template <const auto& = MM_CFG, typename ...> class MATMUL_POLICY = DEFAULT

参数说明

表1 MatmulType参数说明
参数	说明
POSITION	内存逻辑位置针对Atlas A2训练系列产品/Atlas 800I A2推理产品： A矩阵可设置为TPosition::GM，TPosition::VECOUT，TPosition::TSCM B矩阵可设置为TPosition::GM，TPosition::VECOUT，TPosition::TSCM Bias可设置为TPosition::GM，TPosition::VECOUT，TPosition::TSCM C矩阵可设置为TPosition::GM，TPosition::VECIN 针对Atlas推理系列产品AI Core： A矩阵可设置为TPosition::GM，TPosition::VECOUT B矩阵可设置为TPosition::GM，TPosition::VECOUT Bias可设置为TPosition::GM，TPosition::VECOUT C矩阵可设置为TPosition::GM，TPosition::VECIN 针对Atlas 200/500 A2推理产品： A矩阵可设置为TPosition::GM，TPosition::VECOUT，TPosition::TSCM B矩阵可设置为TPosition::GM，TPosition::VECOUT，TPosition::TSCM C矩阵可设置为TPosition::GM，TPosition::VECIN Bias可设置为TPosition::GM，TPosition::VECOUT
CubeFormat	针对Atlas A2训练系列产品/Atlas 800I A2推理产品： A矩阵可设置为CubeFormat::ND，CubeFormat::NZ B矩阵可设置为CubeFormat::ND，CubeFormat::NZ Bias可设置为CubeFormat::ND C矩阵可设置为CubeFormat::ND，CubeFormat::NZ，CubeFormat::ND_ALIGN 针对Atlas推理系列产品AI Core： A矩阵可设置为CubeFormat::ND，CubeFormat::NZ B矩阵可设置为CubeFormat::ND，CubeFormat::NZ Bias可设置为CubeFormat::ND C矩阵可设置为CubeFormat::ND，CubeFormat::NZ，CubeFormat::ND_ALIGN 注意：针对Atlas推理系列产品AI Core，C矩阵设置为CubeFormat::ND时，要求尾轴32字节对齐，比如数据类型是half的情况下，N要求是16的倍数针对Atlas 200/500 A2推理产品： A矩阵可设置为CubeFormat::ND，CubeFormat::NZ B矩阵可设置为CubeFormat::ND，CubeFormat::NZ C矩阵可设置为CubeFormat::ND，CubeFormat::NZ Bias可设置为CubeFormat::ND 注意: 针对Atlas 200/500 A2推理产品，C矩阵设置为TPosition::VECIN或者TPosition::TSCM，CubeFormat::ND时，要求尾轴32字节对齐，比如数据类型是half的情况下，N要求是16的倍数；C矩阵设置为TPosition::VECIN或者TPosition::TSCM，CubeFormat::NZ时，N要求是16的倍数。
TYPE	针对Atlas A2训练系列产品/Atlas 800I A2推理产品： A矩阵可设置为half、float、bfloat16_t 、int8_t、int4b_t B矩阵可设置为half、float、bfloat16_t 、int8_t、int4b_t Bias可设置为half、float、int32_t C矩阵可设置为half、float、bfloat16_t、int32_t、int8_t 针对Atlas推理系列产品AI Core： A矩阵可设置为half、int8_t B矩阵可设置为half、int8_t Bias可设置为float、int32_t C矩阵可设置为half、float、int8_t、int32_t 针对Atlas 200/500 A2推理产品： A矩阵可设置为half、float、bfloat16_t 、int8_t B矩阵可设置为half、float、bfloat16_t 、int8_t Bias矩阵可设置为half、float、int32_t C矩阵可设置为half、float、bfloat16_t、int32_t 注意：A矩阵和B矩阵数据类型需要一致，具体数据类型组合关系请参考表2。
ISTRANS	是否开启使能矩阵转置的功能。 true为开启使能矩阵转置的功能，开启后，分别通过SetTensorA和SetTensorB中的isTransposeA、isTransposeB参数设置A、B矩阵是否转置。若设置A、B矩阵转置，Matmul会认为A矩阵形状为[K, M]，B矩阵形状为[N, K]。 false为不开启使能矩阵转置的功能，通过SetTensorA和SetTensorB不能设置A、B矩阵是否转置。Matmul会认为A矩阵形状为[M, K]，B矩阵形状为[K, N]。默认为false不使能转置。
LAYOUT	表征数据的排布 NONE：默认值，表示不使用BatchMatmul；其他选项表示使用BatchMatmul。 NORMAL：BMNK的数据排布格式。 BSNGD：原始BSH shape做reshape后的数据排布，具体可参考IterateBatch中对该数据排布的介绍。 SBNGD：原始SBH shape做reshape后的数据排布，具体可参考IterateBatch中对该数据排布的介绍。 BNGS1S2：一般为前两种数据排布进行矩阵乘的输出，S1S2数据连续存放，一个S1S2为一个batch的计算数据。
IBSHARE	是否使能IBShare。IBShare的功能是能够复用L1上相同的A矩阵或B矩阵数据。当A矩阵和B矩阵同时使能IBShare时，表示L1上的A矩阵和B矩阵同时复用，此时只支持Norm模板。注意，A矩阵和B矩阵同时使能IBShare的场景，需要满足：同一算子中其它Matmul对象的A矩阵和B矩阵也必须同时使能IBShare；获取矩阵计算结果时，只支持调用IterateAll接口输出到GlobalTensor，即计算结果放置于Global Memory的地址，不能调用GetTensorC等接口。除A、B矩阵同时复用的场景外，与Matmul模板参数中的IBShare模板配合使用，具体参数设置详见表2。

表2 MatmulCallBackFunc回调函数接口及参数说明
回调函数功能	回调函数接口	参数说明
可自定义设置不同的搬出数据片段数目等参数，实现将Matmul计算结果从CO1搬出到GM的功能	void DataCopyOut(const __gm__ void gm, const LocalTensor<int8_t> &co1Local, const void dataCopyOutParams, const uint64_t tilingPtr, const uint64_t dataPtr)	gm：输出的GM地址 co1Local: CO1上的计算结果 dataCopyOutParams： Matmul定义的 DataCopyOutParams结构体指针，供用户参考使用 struct DataCopyOutParams { uint16_t cBurstNum; //传输数据片段数目 uint16_t burstLen; //连续传输数据片段长度 uint16_t srcStride;//源tensor相邻连续数据片段间隔 uint32_t dstStride; // 目的tensor相邻连续数据片段间隔 uint16_t oriNSize; // NZ转ND时，源tensorN方向大小 bool enUnitFlag; // 是否使能UnitFlag uint64_t quantScalar; // 量化场景下量化Scalar的值 uint64_t cbufWorkspaceAddr; //量化场景下量化Tensor地址 } tilingPtr: 用户使用SetUserDefInfo设置的tiling参数地址 dataPtr: 用户使用SetSelfDefineData设置的计算数据地址
可自定义左矩阵搬入首地址、搬运块位置、搬运块大小，实现左矩阵从GM搬入L1的功能	void CopyA1(const LocalTensor<int8_t> &aMatrix, const __gm__ void *gm, int row, int col, int useM, int useK, const uint64_t tilingPtr, const uint64_t dataPtr)	aMatrix: 目标L1Buffer地址 gm：左矩阵GM首地址 row、col：搬运块左上角坐标在左矩阵中的位置 useM、useK：搬运块M、K方向大小 tilingPtr: 用户使用SetUserDefInfo设置的tiling参数地址 dataPtr: 用户使用SetSelfDefineData设置的计算数据地址
可自定义右矩阵搬入首地址、搬运块位置、搬运块大小，实现右矩阵从GM搬入L1的功能	void CopyB1(const LocalTensor<int8_t> &bMatrix, const __gm__ void *gm, int row, int col, int useK, int useN, const uint64_t tilingPtr, const uint64_t dataPtr)	bMatrix: 目标L1Buffer地址 gm：右矩阵GM首地址 row、col：搬运块左上角坐标在右矩阵中的位置 useK、useN：搬运块K、N方向大小 tilingPtr: 用户使用SetUserDefInfo设置的tiling参数地址 dataPtr: 用户使用SetSelfDefineData设置的计算数据地址

表3 MatmulApiStaticTiling常量化Tiling参数说明
参数	数据类型	说明
M, N, Ka, Kb, singleCoreM, singleCoreN, singleCoreK, baseM, baseN, baseK, depthA1, depthB1, stepM， stepN，stepKa，stepKb, isBias, transLength, iterateOrder, dbL0A, dbL0B, dbL0C, shareMode, shareL1Size, shareL0CSize, shareUbSize, batchM, batchN, singleBatchM, singleBatchN,	int	与TCubeTiling结构体中各同名参数含义一致。本结构体中的参数是常量化后的常数值。
cfg	MatmulConfig	Matmul模板的参数配置。

返回值

无

支持的型号

Atlas A2训练系列产品/Atlas 800I A2推理产品

Atlas推理系列产品AI Core

Atlas 200/500 A2推理产品

注意事项

无

调用示例

      
           //用户自定义回调函数
void DataCopyOut(const __gm__ void *gm, const LocalTensor<int8_t> &co1Local, const void *dataCopyOutParams, const uint64_t tilingPtr, const uint64_t dataPtr);
void CopyA1(const AscendC::LocalTensor<int8_t> &aMatrix, const __gm__ void *gm, int row, int col, int useM, int useK, const uint64_t tilingPtr, const uint64_t dataPtr);
void CopyB1(const AscendC::LocalTensor<int8_t> &bMatrix, const __gm__ void *gm, int row, int col, int useK, int useN, const uint64_t tilingPtr, const uint64_t dataPtr);

typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> aType; 
typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> bType; 
typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> cType; 
typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> biasType; 
matmul::Matmul<aType, bType, cType, biasType, CFG_MDL> mm1; 
matmul::MatmulConfig mmConfig{false, true, false, 128, 128, 64};
mmConfig.enUnitFlag = false;
matmul::Matmul<aType, bType, cType, biasType, mmConfig> mm2;
matmul::Matmul<aType, bType, cType, biasType, CFG_NORM, matmul::MatmulCallBackFunc<DataCopyOut, CopyA1, CopyB1>> mm3;

父主题： Matmul