下载
中文
注册

aclnnAddLayerNorm

支持的产品型号

  • Atlas 推理系列产品
  • Atlas A2 训练系列产品/Atlas 800I A2 推理产品

接口原型

每个算子分为两段式接口,必须先调用aclnnAddLayerNormGetWorkspaceSize接口获取入参并根据计算流程所需workspace大小,再调用aclnnAddLayerNorm接口执行计算。

  • aclnnStatus aclnnAddLayerNormGetWorkspaceSize(const aclTensor *x1, const aclTensor *x2, const aclTensor *gamma, const aclTensor *beta, const aclTensor *bias, double epsilon, bool additionalOut, const aclTensor *yOut, const aclTensor *meanOut, const aclTensor *rstdOut, const aclTensor *xOut, uint64_t *workspaceSize, aclOpExecutor **executor)
  • aclnnStatus aclnnAddLayerNorm(void *workspace, uint64_t workspaceSize, aclOpExecutor *executor, aclrtStream stream)

功能描述

  • 算子功能:实现AddLayerNorm功能。
  • 计算公式
x=x1+x2+biasx = x1 + x2 + bias y=xxˉVar(x)+epsγ+βy = {{x-\bar{x}}\over\sqrt {Var(x)+eps}} * \gamma + \beta

aclnnAddLayerNormGetWorkspaceSize

  • 参数说明:

    • x1(aclTensor *,计算输入):表示AddLayerNorm中加法计算的输入,将会在算子内做 x1 + x2 + bias 的计算并对计算结果做层归一化;是Device 侧的aclTensor,shape支持1维~8维,数据格式支持ND。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • x2(aclTensor *,计算输入):表示AddLayerNorm中加法计算的输入,将会在算子内做 x1 + x2 + bias 的计算并对计算结果做层归一化;是Device 侧的aclTensor,shape支持1维~8维,数据格式支持ND。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • beta(aclTensor *,计算输入):对应LayerNorm计算公式中的 beta ,表示层归一化中的 beta 参数;是Device 侧的aclTensor,shape支持1维~8维,数据格式支持ND,数据维度和x1/x2的尾轴相同。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • gamma(aclTensor *,计算输入):对应LayerNorm计算公式中的 gamma,表示层归一化中的 gamma 参数;是Device 侧的aclTensor,shape支持1维~8维,数据格式支持ND,数据维度和x1/x2的尾轴相同,尾轴表示需要norm的维度。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • bias(aclTensor *,计算输入):可选输入参数,表示AddLayerNorm中加法计算的输入,将会在算子内做 x1 + x2 + bias 的计算并对计算结果做层归一化;shape可以和gamma/beta或是和x1/x2一致,是Device 侧的aclTensor,shape支持1维~8维,数据格式支持ND。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • epsilon(double *,计算输入):公式中的输入eps,添加到分母中的值,以确保数值稳定;host侧的aclScalar,数据类型为double,默认值为1e-5。
    • additionalOut(bool *,计算输入):表示是否开启x=x1+x2的输出,host侧的aclScalar,数据类型为bool。
    • meanOut(aclTensor *,计算输出):输出 LayerNorm 计算过程中 (x1 + x2) 的结果的均值,Device 侧的aclTensor,数据类型为FLOAT32,shape需要与x1满足broadcast关系(前几维的维度和x1前几维的维度相同,前几维指x1的维度减去gamma的维度,表示不需要norm的维度),数据格式支持ND,该输出在Atlas 推理系列产品上无效。计算逻辑:mean=np.mean(x1+x2)mean = np.mean(x1 + x2)
    • rstdOut(aclTensor *,计算输出):输出 LayerNorm 计算过程中 rstd 的结果,Device 侧的aclTensor,数据类型为FLOAT32,shape需要与x1满足broadcast关系(前几维的维度和x1前几维的维度相同),数据格式支持ND,该输出在Atlas 推理系列产品上无效。计算逻辑:rstd=np.power((np.var(x1+x2)+epsilon),(0.5))rstd = np.power((np.var(x1 + x2) + epsilon), (-0.5))
    • yOut(aclTensor *,计算输出):表示LayerNorm的结果输出y,Device 侧的aclTensor,shape需要与输入x1/x2一致,数据格式支持ND。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • xOut(aclTensor *,计算输出):表示LayerNorm的结果输出x,Device 侧的aclTensor,shape需要与输入x1/x2一致,数据格式支持ND。
      • Atlas 推理系列产品:数据类型支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:数据类型支持FLOAT32、FLOAT16、BFLOAT16。
    • workspaceSize(uint64_t *,出参):返回需要在Device侧申请的workspace大小。
    • executor(aclOpExecutor **,出参):返回op执行器,包含了算子计算流程。
  • 返回值:

    aclnnStatus:返回状态码。(参见aclnn返回码

    第一段接口完成入参校验,出现以下场景时报错:
    161001 (ACLNN_ERR_PARAM_NULLPTR):如果传入参数是必选输入,输出或者必选属性,且是空指针,则返回161001。
    161002 (ACLNN_ERR_PARAM_INVALID):输入和输出的数据类型不在支持的范围之内。

aclnnAddLayerNorm

  • 参数说明:

    • workspace(void *,入参):在Device侧申请的workspace内存返回需要在Device侧。
    • workspaceSize(uint64_t,入参):在Device侧申请的workspace大小,由第一段接口aclnnAddLayerNormGetWorkspaceSize获取。
    • executor(aclOpExecutor *,入参):op执行器,包含了算子计算流程。
    • stream(aclrtStream,入参):指定执行任务的AscendCL stream流。
  • 返回值:

    aclnnStatus:返回状态码。(具体参见aclnn返回码

约束与限制

  • 功能维度
    • 数据类型支持
      • Atlas 推理系列产品:x1、x2、beta、gamma、bias支持FLOAT32、FLOAT16。
      • Atlas A2 训练系列产品/Atlas 800I A2 推理产品:x1、x2、beta、gamma、bias支持FLOAT32、FLOAT16、BFLOAT16。
      • rstd、mean支持:FLOAT32。
    • 数据格式支持:ND。
    • Atlas 推理系列产品:x1、x2、beta、gamma、bias五个输入的尾轴长度必须大于等于 32 Bytes。
  • 未支持类型说明
    • DOUBLE:不支持DOUBLE。
    • 是否支持空tensor:不支持空进空出。
    • 是否非连续tensor:不支持输入非连续。
  • 边界值场景说明
    • 当输入是inf时,输出为inf。
    • 当输入是nan时,输出为nan。
  • 各产品支持数据类型说明
    • Atlas A2 训练系列产品/Atlas 800I A2 推理产品
      x1 数据类型 x2 数据类型 gamma 数据类型 beta 数据类型 bias 数据类型 y 数据类型 mean 数据类型 rstd 数据类型 x 数据类型
      float32 float16 float32 float32 float32 float32 float32 float32 float32
      float32 bfloat16 float32 float32 float32 float32 float32 float32 float32
      float16 float32 float32 float32 float32 float32 float32 float32 float32
      bfloat16 float32 float32 float32 float32 float32 float32 float32 float32
      float16 float16 float32 float32 float16 float16 float32 float32 float16
      bfloat16 bfloat16 float32 float32 bfloat16 bfloat16 float32 float32 bfloat16
      float16 float16 float16 float16 float16 float16 float32 float32 float16
      bfloat16 bfloat16 bfloat16 bfloat16 bfloat16 bfloat16 float32 float32 bfloat16
      float32 float32 float32 float32 float32 float32 float32 float32 float32
    • Atlas 推理系列产品
      x1 数据类型 x2 数据类型 gamma 数据类型 beta 数据类型 bias 数据类型 y 数据类型 mean 数据类型 rstd 数据类型 x 数据类型
      float32 float32 float32 float32 float32 float32 float32 float32 float32
      float16 float16 float16 float16 float16 float16 float32 float32 float16

调用示例

示例代码如下,仅供参考,具体编译和执行过程请参考编译与运行样例

#include <iostream>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add_layer_norm.h"

#define CHECK_RET(cond, return_expr)\
do {                                \
  if (!(cond)) {                    \
    return_expr;                    \
  }                                 \
} while (0)

#define LOG_PRINT(message, ...)   \
    do {                          \
  printf(message, ##__VA_ARGS__); \
} while (0)

int64_t GetShapeSize(const std::vector<int64_t> &shape) {
  int64_t shapeSize = 1;
  for (auto i : shape) {
    shapeSize *= i;
  }
  return shapeSize;
}

int Init(int32_t deviceId, aclrtStream *stream) {
  // 固定写法,AscendCL初始化
  auto ret = aclInit(nullptr);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret);
  ret = aclrtSetDevice(deviceId);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret);
  ret = aclrtCreateStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret);
  return 0;
}

template <typename T>
int CreateAclTensor(const std::vector<T> &hostData, const std::vector<int64_t> &shape, void **deviceAddr,
                    aclDataType dataType, aclTensor **tensor) {
  auto size = GetShapeSize(shape) * sizeof(T);
  // 调用aclrtMalloc申请device侧内存
  auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
  // 调用aclrtMemcpy将host侧数据拷贝到device侧内存上
  ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);

  // 计算连续tensor的strides
  std::vector<int64_t> strides(shape.size(), 1);
  for (int64_t i = shape.size() - 2; i >= 0; i--) {
    strides[i] = shape[i + 1] * strides[i + 1];
  }

  // 调用aclCreateTensor接口创建aclTensor
  *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND,
                            shape.data(), shape.size(), *deviceAddr);
  return 0;
}

int main() {
  // 1. (固定写法)device/stream初始化,参考AscendCL对外接口列表
  // 根据自己的实际device填写deviceId
  int32_t deviceId = 0;
  aclrtStream stream;
  auto ret = Init(deviceId, &stream);
  // check根据自己的需要处理
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("Init acl failed. ERROR: %d\n", ret); return ret);

  // 2. 构造输入与输出,需要根据API的接口自定义构造,本示例中将各调用一次不带bias可选输入的和带bias输入的用例
  float eps = 1e-6;
  bool additionalOut = true;

  std::vector<int64_t> x1Shape = {1, 2, 8};
  std::vector<int64_t> x2Shape = {1, 2, 8};
  std::vector<int64_t> gammaShape = {8};
  std::vector<int64_t> betaShape = {8};
  std::vector<int64_t> biasShape = {8};

  std::vector<int64_t> outputYShape = {1, 2, 8};
  std::vector<int64_t> outputMeanShape = {1, 2, 1};
  std::vector<int64_t> outputRstdShape = {1, 2, 1};
  std::vector<int64_t> outputXShape = {1, 2, 8};

  void *x1DeviceAddr = nullptr;
  void *x2DeviceAddr = nullptr;
  void *betaDeviceAddr = nullptr;
  void *gammaDeviceAddr = nullptr;
  void *biasDeviceAddr = nullptr;

  // 用于不带bias的输出 Device 地址
  void *outputYDeviceAddr = nullptr;
  void *outputMeanDeviceAddr = nullptr;
  void *outputRstdDeviceAddr = nullptr;
  void *outputXDeviceAddr = nullptr;

  // 用于带bias的输出 Device 地址
  void *outputYDeviceAddrBias = nullptr;
  void *outputMeanDeviceAddrBias = nullptr;
  void *outputRstdDeviceAddrBias = nullptr;
  void *outputXDeviceAddrBias = nullptr;

  aclTensor *x1 = nullptr;
  aclTensor *x2 = nullptr;
  aclTensor *beta = nullptr;
  aclTensor *gamma = nullptr;
  aclTensor *bias = nullptr;

  // 用于不带bias的aclTensor
  aclTensor *outputY = nullptr;
  aclTensor *outputMean = nullptr;
  aclTensor *outputRstd = nullptr;
  aclTensor *outputX = nullptr;

  // 用于带bias的aclTensor
  aclTensor *outputYBias = nullptr;
  aclTensor *outputMeanBias = nullptr;
  aclTensor *outputRstdBias = nullptr;
  aclTensor *outputXBias = nullptr;

  std::vector<float> x1HostData = {1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2};
  std::vector<float> x2HostData = {4, 4, 4, 4, 4, 4, 4, 4, -3, -3, -3, -3, -3, -3, -3, -3};
  std::vector<float> gammaHostData = {2, 2, 2, 2, 2, 2, 2, 2};
  std::vector<float> betaHostData = {0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1};
  std::vector<float> biasHostData = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5};

  // 用于不带bias的HostData
  std::vector<float> outputYHostData(1 * 2 * 8);
  std::vector<float> outputMeanHostData(2);
  std::vector<float> outputRstdHostData(2);
  std::vector<float> outputXHostData(1 * 2 * 8);

  // 用于带bias的HostData
  std::vector<float> outputYHostDataBias(1 * 2 * 8);
  std::vector<float> outputMeanHostDataBias(2);
  std::vector<float> outputRstdHostDataBias(2);
  std::vector<float> outputXHostDataBias(1 * 2 * 8);

  // 创建self aclTensor
  ret = CreateAclTensor(x1HostData, x1Shape, &x1DeviceAddr, aclDataType::ACL_FLOAT, &x1);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(x2HostData, x2Shape, &x2DeviceAddr, aclDataType::ACL_FLOAT, &x2);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(betaHostData,  betaShape, & betaDeviceAddr, aclDataType::ACL_FLOAT, &beta);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(gammaHostData, gammaShape, &gammaDeviceAddr, aclDataType::ACL_FLOAT, &gamma);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(biasHostData, biasShape, &biasDeviceAddr, aclDataType::ACL_FLOAT, &bias);
  CHECK_RET(ret == ACL_SUCCESS, return ret);

  // 创建不带 bias 的 aclTensor
  ret = CreateAclTensor(outputYHostData, outputYShape, &outputYDeviceAddr, aclDataType::ACL_FLOAT, &outputY);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputMeanHostData, outputMeanShape, &outputMeanDeviceAddr, aclDataType::ACL_FLOAT, &outputMean);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputRstdHostData, outputRstdShape, &outputRstdDeviceAddr, aclDataType::ACL_FLOAT, &outputRstd);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputXHostData, outputXShape, &outputXDeviceAddr, aclDataType::ACL_FLOAT, &outputX);
  CHECK_RET(ret == ACL_SUCCESS, return ret);

  // 创建带 bias 的 aclTensor
  ret = CreateAclTensor(outputYHostDataBias, outputYShape, &outputYDeviceAddrBias, aclDataType::ACL_FLOAT, &outputYBias);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputMeanHostDataBias, outputMeanShape, &outputMeanDeviceAddrBias, aclDataType::ACL_FLOAT, &outputMeanBias);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputRstdHostDataBias, outputRstdShape, &outputRstdDeviceAddrBias, aclDataType::ACL_FLOAT, &outputRstdBias);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  ret = CreateAclTensor(outputXHostDataBias, outputXShape, &outputXDeviceAddrBias, aclDataType::ACL_FLOAT, &outputXBias);
  CHECK_RET(ret == ACL_SUCCESS, return ret);

  // aclnnAddLayerNorm接口调用示例,包含带bias和不带bias的各一次
  // 3. 调用CANN算子库API,需要修改为具体的Api名称

  // 3.1 不带bias可选输入的示例
  // 调用aclnnAddLayerNorm第一段接口
  uint64_t workspaceSize = 0;
  aclOpExecutor *executor;
  LOG_PRINT("\nUse aclnnAddLayerNorm Non-Bias Port.");
  // bias参数直接传入nullptr即可
  ret = aclnnAddLayerNormGetWorkspaceSize(x1, x2, gamma, beta, nullptr, eps, additionalOut,
                                      outputY, outputMean, outputRstd, outputX,
                                      &workspaceSize, &executor);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddLayerNormGetWorkspaceSize failed. ERROR: %d\n", ret);
            return ret);
  // 根据第一段接口计算出的workspaceSize申请device内存
  void *workspaceAddr = nullptr;
  if (workspaceSize > 0) {
    ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret;);
  }
  // 调用aclnnAddLayerNorm第二段接口
  ret = aclnnAddLayerNorm(workspaceAddr, workspaceSize, executor, stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddLayerNorm failed. ERROR: %d\n", ret); return ret);

  // 3.2 带bias可选输入的示例
  // 调用aclnnAddLayerNorm第一段接口
  uint64_t workspaceSizeBias = 0;
  aclOpExecutor *executorBias;
  LOG_PRINT("\nUse aclnnAddLayerNorm Bias Port.");
  // 正常传入bias即可
  ret = aclnnAddLayerNormGetWorkspaceSize(x1, x2, gamma, beta, bias, eps, additionalOut,
                                      outputYBias, outputMeanBias, outputRstdBias, outputXBias,
                                      &workspaceSizeBias, &executorBias);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddLayerNormGetWorkspaceSize failed. ERROR: %d\n", ret);
            return ret);
  // 根据第一段接口计算出的workspaceSize申请device内存
  void *workspaceAddrBias = nullptr;
  if (workspaceSizeBias > 0) {
    ret = aclrtMalloc(&workspaceAddrBias, workspaceSizeBias, ACL_MEM_MALLOC_HUGE_FIRST);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret;);
  }
  // 调用aclnnAddLayerNorm第二段接口
  ret = aclnnAddLayerNorm(workspaceAddrBias, workspaceSizeBias, executorBias, stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddLayerNorm failed. ERROR: %d\n", ret); return ret);

  // 4. (固定写法)同步等待任务执行结束
  ret = aclrtSynchronizeStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret);

  // 5. 获取输出的值,将device侧内存上的结果拷贝至host侧,需要根据具体API的接口定义修改

  // 5.1 考出不带bias的输出
  auto outputYSize = GetShapeSize(outputYShape);
  std::vector<float> resultDataY(outputYSize, 0);
  ret = aclrtMemcpy(resultDataY.data(), resultDataY.size() * sizeof(resultDataY[0]), outputYDeviceAddr,
                    outputYSize * sizeof(resultDataY[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm non-bias: y output");
  for (int64_t i = 0; i < outputYSize; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataY[i]);
  }

  auto outputMeanSize = GetShapeSize(outputMeanShape);
  std::vector<float> resultDataMean(outputMeanSize, 0);
  ret = aclrtMemcpy(resultDataMean.data(), resultDataMean.size() * sizeof(resultDataMean[0]), outputMeanDeviceAddr,
                    outputMeanSize * sizeof(resultDataMean[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm non-bias: mean output");
  for (int64_t i = 0; i < outputMeanSize; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataMean[i]);
  }

  auto outputRstdSize = GetShapeSize(outputRstdShape);
  std::vector<float> resultDataRstd(outputRstdSize, 0);
  ret = aclrtMemcpy(resultDataRstd.data(), resultDataRstd.size() * sizeof(resultDataRstd[0]), outputRstdDeviceAddr,
                    outputRstdSize * sizeof(resultDataRstd[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm non-bias: rstd output");
  for (int64_t i = 0; i < outputRstdSize; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataRstd[i]);
  }

  auto outputXSize = GetShapeSize(outputXShape);
  std::vector<float> resultDataX(outputXSize, 0);
  ret = aclrtMemcpy(resultDataX.data(), resultDataX.size() * sizeof(resultDataX[0]), outputXDeviceAddr,
                    outputXSize * sizeof(resultDataX[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm non-bias: x output");
  for (int64_t i = 0; i < outputXSize; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataX[i]);
  }

  // 5.2 考出带bias的输出
  auto outputYSizeBias = GetShapeSize(outputYShape);
  std::vector<float> resultDataYBias(outputYSizeBias, 0);
  ret = aclrtMemcpy(resultDataYBias.data(), resultDataYBias.size() * sizeof(resultDataYBias[0]), outputYDeviceAddrBias,
                    outputYSizeBias * sizeof(resultDataYBias[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm bias: y output");
  for (int64_t i = 0; i < outputYSizeBias; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataYBias[i]);
  }

  auto outputMeanSizeBias = GetShapeSize(outputMeanShape);
  std::vector<float> resultDataMeanBias(outputMeanSizeBias, 0);
  ret = aclrtMemcpy(resultDataMeanBias.data(), resultDataMeanBias.size() * sizeof(resultDataMeanBias[0]), outputMeanDeviceAddrBias,
                    outputMeanSizeBias * sizeof(resultDataMeanBias[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm bias: mean output");
  for (int64_t i = 0; i < outputMeanSizeBias; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataMeanBias[i]);
  }

  auto outputRstdSizeBias = GetShapeSize(outputRstdShape);
  std::vector<float> resultDataRstdBias(outputRstdSizeBias, 0);
  ret = aclrtMemcpy(resultDataRstdBias.data(), resultDataRstdBias.size() * sizeof(resultDataRstdBias[0]), outputRstdDeviceAddrBias,
                    outputRstdSizeBias * sizeof(resultDataRstdBias[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm bias: rstd output");
  for (int64_t i = 0; i < outputRstdSizeBias; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataRstdBias[i]);
  }

  auto outputXSizeBias = GetShapeSize(outputXShape);
  std::vector<float> resultDataXBias(outputXSizeBias, 0);
  ret = aclrtMemcpy(resultDataXBias.data(), resultDataXBias.size() * sizeof(resultDataXBias[0]), outputXDeviceAddrBias,
                    outputXSizeBias * sizeof(resultDataXBias[0]), ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  LOG_PRINT("==== AddLayerNorm bias: x output");
  for (int64_t i = 0; i < outputXSizeBias; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultDataXBias[i]);
  }


  // 6. 释放aclTensor和aclScalar,需要根据具体API的接口定义修改
  aclDestroyTensor(x1);
  aclDestroyTensor(x2);
  aclDestroyTensor(beta);
  aclDestroyTensor(gamma);
  aclDestroyTensor(bias);

  aclDestroyTensor(outputY);
  aclDestroyTensor(outputMean);
  aclDestroyTensor(outputRstd);
  aclDestroyTensor(outputX);

  aclDestroyTensor(outputYBias);
  aclDestroyTensor(outputMeanBias);
  aclDestroyTensor(outputRstdBias);
  aclDestroyTensor(outputXBias);

  // 7. 释放device资源,需要根据具体API的接口定义修改
  aclrtFree(x1DeviceAddr);
  aclrtFree(x2DeviceAddr);
  aclrtFree(gammaDeviceAddr);
  aclrtFree(betaDeviceAddr);
  aclrtFree(biasDeviceAddr);

  aclrtFree(outputYDeviceAddr);
  aclrtFree(outputMeanDeviceAddr);
  aclrtFree(outputRstdDeviceAddr);
  aclrtFree(outputXDeviceAddr);

  aclrtFree(outputYDeviceAddrBias);
  aclrtFree(outputMeanDeviceAddrBias);
  aclrtFree(outputRstdDeviceAddrBias);
  aclrtFree(outputXDeviceAddrBias);

  if (workspaceSize > 0) {
    aclrtFree(workspaceAddr);
  }

  if (workspaceSizeBias > 0) {
    aclrtFree(workspaceAddrBias);
  }

  aclrtDestroyStream(stream);
  aclrtResetDevice(deviceId);
  aclFinalize();
  return 0;
}