激活层为geglu/swiglu/reglu时,性能使能需要满足门槛要求,即整网中FFN结构所对应的小算子中vector耗时30us且占比10%以上的用例方可尝试FFN融合算子;或在不知道小算子性能的情况下,尝试使能FFN,若性能劣化则不使能FFN。
npu_ffn(Tensor x, Tensor weight1, Tensor weight2, str activation, *, int[]? expert_tokens=None, int[]? expert_tokens_index=None, Tensor? bias1=None, Tensor? bias2=None, Tensor? scale=None, Tensor? offset=None, Tensor? deq_scale1=None, Tensor? deq_scale2=None, Tensor? antiquant_scale1=None, Tensor? antiquant_scale2=None, Tensor? antiquant_offset1=None, Tensor? antiquant_offset2=None, int? inner_precise=None, ScalarType? output_dtype=None) -> Tensor
M表示token个数,对应transform中的BS(B:Batch,表示输入样本批量大小,S:Seq-Length,表示输入样本序列长度);K1表示第一个matmul的输入通道数,对应transform中的H(Head-Size,表示隐藏层的大小);N1表示第一个matmul的输出通道数;K2表示第二个matmul的输入通道数;N2表示第二个matmul的输出通道数,对应transform中的H;E表示有专家场景的专家数。
inner_precise参数在BFLOAT16非量化场景,只能配置为0;FLOAT16非量化场景,可以配置为0或者1;量化或者伪量化场景,0和1都可配置,但是配置后不生效。
一个Tensor类型的输出,公式中的输出y,数据类型支持FLOAT16、BFLOAT16,数据格式支持ND,输出维度与x一致。
1 2 3 4 5 6 7 8 9 | import torch import torch_npu import logging import os cpu_x = torch.randn((1, 1280), device='npu', dtype=torch.float16) cpu_weight1 = torch.randn(1280, 10240, device='npu', dtype=torch.float16) cpu_weight2 = torch.randn(10240, 1280, device='npu', dtype=torch.float16) activation = "fastgelu" npu_out = torch_npu.npu_ffn(cpu_x.npu(), cpu_weight1.npu(), cpu_weight2.npu(), activation, inner_precise=1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # torch api 入图方式 import torch import torch_npu import torchair as tng from torchair.ge_concrete_graph import ge_apis as ge from torchair.configs.compiler_config import CompilerConfig import logging from torchair.core.utils import logger logger.setLevel(logging.DEBUG) import os os.environ["ENABLE_ACLNN"] = "true" config = CompilerConfig() config.debug.graph_dump.type = "pbtxt" npu_backend = tng.get_npu_backend(compiler_config=config) class MyModel(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, weight1, weight2, activation, expert): return torch_npu.npu_ffn(x, weight1, weight2, activation, expert_tokens=expert, inner_precise=1) cpu_model = MyModel() cpu_x = torch.randn((1954, 2560),device='npu',dtype=torch.float16) cpu_weight1 = torch.randn((16, 2560, 5120),device='npu',dtype=torch.float16) cpu_weight2 = torch.randn((16, 5120, 2560),device='npu',dtype=torch.float16) activation = "fastgelu" expert = [227, 62, 78, 126, 178, 27, 122, 1, 19, 182, 166, 118, 66, 217, 122, 243] model = cpu_model.npu() model = torch.compile(cpu_model, backend=npu_backend, dynamic=True) npu_out = model(cpu_x.npu(), cpu_weight1.npu(), cpu_weight2.npu(), activation, expert) |