下载
中文
注册
LLaMA 65B-PyTorch

LLaMA-33B/65B

LLaMA 模型源于: LLaMA: OPen and Efficient Foundation Language Models

Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).

训练

LLaMA-33B/65B 训练的硬件配置:

硬件 配置
NPU 32 x Ascend NPUs

数据集

模型使用 alpaca 数据集训练

脚本

  1. 拷贝仓库到个人服务器
git clone https://gitee.com/ascend.AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
  1. 搭建环境
# python3.8
conda create -n test python=3.8
conda activate test

# 安装 torch 和 torch_npu
# ARM
wget https://download.pytorch.org/whl/torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl

# X86
#pip install torch==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
#pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl

# 安装 megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core

# 安装 deepspeed 和 deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..

# 安装其他包
pip install -r requirements.txt
  1. 下载权重

llama-33B 权重

mkdir tokenizer
cd ./tokenizer

# 需要安装 git-lfs
git lfs install
git clone https://huggingface.co/pinkmanlove/llama-33b-hf
cd ..

llama-65B 权重

mkdir tokenizer
cd ./tokenizer

# 需要安装 git-lfs
git lfs install
git clone https://huggingface.co/pinkmanlove/llama-65b-hf
cd ..
  1. 预训练权重从 huggingface 格式转换为 AscendSpeed 格式

llama-33B

mkdir model_weights

SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
      --input-model-dir ./tokenizer \
      --output-model-dir ./model_weights \
      --tensor-model-parallel-size 4 \
      --pipeline-model-parallel-size 4 \
      --merge-mlp \
      --type 30B

llama-65B

mkdir model_weights

SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
      --input-model-dir ./tokenizer \
      --output-model-dir ./model_weights \
      --tensor-model-parallel-size 8 \
      --pipeline-model-parallel-size 4 \
      --type 65B
  1. 下载数据集
# 下载 alpaca 数据集
wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason

# 下载 tokenizer 配置 和 (可选择的) 权重:
# http://huggingface.co/pinkmanlove/llama-33b-hf
# http://huggingface.co/pinkmanlove/llama-65b-hf
# 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" (这是hf的一个bug)
mkdir dataset
python tools/preprocess_data.py --input alpaca_data.json\
                                --output-prefix dataset/alpaca\
                                --tokenizer-type PretrainedFromHF\
                                --tokenizer-name-or-path llama-33b-hf
                               #--tokenizer-name-or-path llama-65b-hf
                                --tokenizer-not-use-fast
                                --handler-name GeneralInstructionHandler
  1. 配置 llama-33B/65B 预训练脚本:

AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh

AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh

# 修改 ascend-toolkit 路径
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HEEL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1

# 配置词表和数据路径等
TOKENIZER_PATH=./dataset/llama_tokenizer # line 16
DATA_PATH=./dataset/llama_text_document # line 17
  1. 启动预训练脚本:

启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh

bash examples/llama/pretrain_llama_33B_ptd_32p.sh

启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh

bash examples/llama/pretrain_llama_65B_ptd_32p.sh

为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本):

MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=4
NODE_RANK=0

训练log如下:

 iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
time (ms)

性能

吞吐

LLaMA-33B/65B在 昇腾芯片参考芯片 上的性能对比:

设备 模型 tokens吞吐 (tokens/s/p)
参考 llama-33B 776
NPUs llama-33B 621
参考 llama-65B 426
NPUs llama-65B 348

精度

NPU vs 参考 loss 和相对误差:

LLaMa-33B

NPU-LOSS

NPU-Relative-Error

LLaMa-65B

NPU-LOSS

NPU-Relative-Error

推理

我们支持使用 LLaMA-33B 和 LLaMA-65B 进行文本生成的推理。 推理与预训练不同,比如我们需要加载预训练权重和输出样本的长度:

配置LLaMA-33B推理脚本examples/llama/generate_llama_33B_ptd.sh

配置LLaMA-65B推理脚本examples/llama/generate_llama_65B_tp8_pp1.sh

# 修改模型权重路径和分词器路径
CHECKPOINT=<checkpoint-path>
VOCAB_FILE=<vocabfile-path>

LLaMA-33B:

bash ./examples/llama/generate_llama_33B_ptd.sh

LLaMA-65B:

bash ./examples/llama/generate_llama_65B_tp8_pp1.sh

部分推理样本如下:

LLaMA-33B:

llama-13B_generate.png

LLaMA-65B:

llama-65B_generate.png

使用基线数据集进行评估

我们使用 Boolq benchmark 来评估我们的模型。Benchmark下载此处

配置LLaMA-33B评估脚本:

    CHECKPOINT=./llama-33b-tp4-pp2/
    VOCAB_FILE=./llama-33b-hf/
    # 配置任务和数据路径
    DATA_PATH="./boolq/data/test/"
    TASK="boolq"
    # 配置生成参数
    python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py   \
         --task-data-path $DATA_PATH \
         --task $TASK\
         --seq-length 1024 \
         --max-new-tokens 2 \
         --max-position-embeddings 1024 \
         --tensor-model-parallel-size 4  \
         --pipeline-model-parallel-size 2  \
         --num-layers 60 \
         --hidden-size 6656  \
         --ffn-hidden-size 17920 \
         --load ${CHECKPOINT}  \
         --num-attention-heads 52  \
         --tokenizer-type PretrainedFromHF  \
         --tokenizer-name-or-path ${VOCAB_FILE} \
         --tokenizer-not-use-fast \
         --fp16  \
         --micro-batch-size 1  \
         --position-embedding-type rope \
         --normalization RMSNorm \
         --mlp-layer-fusion \
         --seed 42
# 开始评估
# llama-65B评估
bash tasks/evaluation/evaluate_llama_65B_tp8_pp1.sh

LLaMA-33B和LLaMA-65B在Ascend NPU中的评测表现:

任务 模型 昇腾值 社区值
Boolq LLaMA-33B 83.2 83.1
Boolq LLaMA-65B 85.7 86.6

引用

@article{Touvron2023llama,
  title={LLaMA: OPen and Efficient Foundation Language Models},
  author={Hugo Touvron*, Thibaut Lavril*, Gautier Izacard*, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal,
  Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave*, Guillaume Lample*},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}}
使用模型资源和服务前,请您仔细阅读并理解透彻 《昇腾深度学习模型许可协议 3.0》