简介

大模型稀疏量化工具包括稀疏、量化和压缩三个部分：

稀疏：模型稀疏工具通过算法判断模型权重中每个元素对精度结果的重要性，并将模型权重中对最终精度影响小的权重值置零。
量化：对权重和激活值都做量化，将高位浮点数转为8bit，可以直接降低权重体积，带来性能收益。
压缩：权重压缩工具将模型权重通过压缩算法进一步编码压缩，最大程度地降低权重体积，生成压缩后权重和索引文件。

压缩算法和硬件强相关，仅Atlas 300I Duo 推理卡支持稀疏量化。
bfloat16权重不支持稀疏量化。

稀疏+量化后权重目录结构：

├─ config.json
├─ quant_model_weight_w8a8s.safetensors
├─ quant_model_description_w8a8s.json
├─ tokenizer_config.json
├─ tokenizer.json
└─ tokenizer.model

量化后产物包含：权重文件quant_model_weight_w8a8s.safetensors和权重描述文件quant_model_description_w8a8s.json。
目录中的其余文件为推理时所需的配置文件，不同模型略有差异。

以下展示了量化后权重描述文件quant_model_description_w8a8s.json中的部分内容：

{
  "model_quant_type": "W8A8S",
  "model.embed_tokens.weight": "FLOAT",
  "model.layers.0.self_attn.q_proj.weight": "W8A8S",
  "model.layers.0.self_attn.q_proj.input_scale": "W8A8S",
  "model.layers.0.self_attn.q_proj.input_offset": "W8A8S",
  "model.layers.0.self_attn.q_proj.quant_bias": "W8A8S",
  "model.layers.0.self_attn.q_proj.deq_scale": "W8A8S",
}

量化后的MatMul权重新增input_scale、input_offset、quant_bias和deq_scale。其中input_scale和input_offset用于对激活值进行量化。MatMul使用量化后的激活值和量化权重进行计算。quant_bias和deq_scale用于对MatMul的计算结果进行反量化。

压缩后权重目录结构：

├─ config.json
├─ part0-of-4
│  ├─ quant_model_weight_w8a8sc.safetensors
│  └─ quant_model_description_w8a8sc.json
├─ part1-of-4
│  ├─ quant_model_weight_w8a8sc.safetensors
│  └─ quant_model_description_w8a8sc.json
├─ part2-of-4
│  ├─ quant_model_weight_w8a8sc.safetensors
│  └─ quant_model_description_w8a8sc.json
├─ part3-of-4
│  ├─ quant_model_weight_w8a8sc.safetensors
│  └─ quant_model_description_w8a8sc.json
├─ tokenizer_config.json
├─ tokenizer.json
└─ tokenizer.model

压缩前会先加载权重，并进行多卡切分，压缩算法须在切分后的权重上执行。

以下展示了量化后权重描述文件part0-of-4/quant_model_description_w8a8sc.json中的部分内容：

{
  "model_quant_type": "W8A8SC",
  "model.embed_tokens.weight": "FLOAT",
  "model.layers.0.self_attn.query_key_value.weight": "W8A8SC",
  "model.layers.0.self_attn.query_key_value.index": "W8A8SC",
  "model.layers.0.self_attn.query_key_value.info": "W8A8SC",
  "model.layers.0.self_attn.query_key_value.input_scale": "W8A8S",
  "model.layers.0.self_attn.query_key_value.input_offset": "W8A8S",
  "model.layers.0.self_attn.query_key_value.deq_scale": "W8A8S",
  "model.layers.0.self_attn.query_key_value.quant_bias": "W8A8S",
}

压缩后的MatMul权重相比量化新增了index，压缩信息用于复原权重。

图1 量化权重推理时流程
点击放大

表1 float16权重量化后dtype及shape信息（假设原始权重的shape为[n, k]）
Tensor信息	weight	input_scale	input_offset	quant_bias	deq_scale	index
dtype	int8	float16	int8	int32	int64	int8
shape	[x] x的取值范围在(0, n * k)之间。	[1]	[1]	[n]	[n]	[y] y由以下计算得出。 y = k_index * n_index * 8 k_index = ceil(k1 / tilingK) n_index = ceil(n1 / tilingN) k1 = k / 32 n1 = n / 16 其中，ceil()为向上取整函数，tilingK和tilingN为稀疏量化默认参数。

生成权重

以LLaMa-33B为例：

使用以下指令生成W8A8S量化权重。

cd ${ATB_SPEED_HOME_PATH}
get_down_proj_disable_name() {
    local num_layer=$1
    local disable_names=""
    for ((i=0; i<$num_layer; i++)); do
        disable_names="$disable_names model.layers.$i.mlp.down_proj"
    done
    echo "$disable_names"
}
disable_names=$(get_down_proj_disable_name 60)
python -m examples.convert.model_slim.quantifier --model_path {浮点权重路径} --save_directory {W8A8S量化权重路径} --calib_file $llm_path/examples/convert/model_slim/boolq.jsonl --disable_names $disable_names --act_method 2 --do_smooth True --use_sigma True --is_lowbit True --co_sparse True --w_bit 4 --tokenizer_args '{"padding_side":"left","pad_token":"<unk>"}'

以上指令包含生成LLaMa-33B W8A8SC 稀疏量化权重的最优参数配置，不同模型的参数配置不同，请参考模型Readme文件。
生成权重后需将浮点权重下的special_tokens_map.json文件复制到W8A8S量化权重路径
W8A8S量化权重的config.json中应包含quantize字段，其值为"w8a8s"。

使用以下指令对量化权重进行压缩，生成W8A8SC量化权重。

torchrun --nproc_per_node 2 -m examples.convert.model_slim.sparse_compressor --model_path {W8A8S量化权重路径} --save_directory {W8A8SC量化权重路径}

W8A8SC量化权重的config.json中应包含quantize字段，其值为"w8a8sc"。

执行推理

以LLaMa-33B为例，您可以使用以下指令执行对话测试，推理内容为"What's deep learning?"。

cd ${ATB_SPEED_HOME_PATH}
bash examples/models/llama/run_pa.sh {W8A8SC量化权重路径}

W8A8SC 稀疏量化

简介

生成权重

执行推理