使用方法：

运行脚本bash tasks/evaluation/eval.sh，其中，脚本参数配置示例：

python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
    --task-data-path $DATA_PATH\
    --task $TASK \
    --seq-length 2048 \
    --max-new-tokens 1 \
    --max-position-embeddings 2048 \
    --tensor-model-parallel-size 2  \
    --pipeline-model-parallel-size 4  \
    --num-layers 32  \
    --hidden-size 4096  \
    --ffn-hidden-size 11008 \
    --load ${CHECKPOINT}  \
    --num-attention-heads 32  \
    --tokenizer-type PretrainedFromHF  \
    --tokenizer-name-or-path $VOCAB_FILE \
    --tokenizer-not-use-fast \
    --fp16  \
    --micro-batch-size 1  \
    --seed 42 | tee logs/train.log

主要关注的参数：

task-data-path：每一个数据路径为str型，各个任务测试数据的路径。
task：TASK处配置要制定测评的数据集，如果有多个值，则都要测评。注：由于数据集的运行参数不尽相同，不建议多个数据集一起测评。
CHECKPOINT：配置训练好的模型的路径。
max-new-tokens：最大生成token数量。（同在线推理解释）
tp与pp常规设定值，乘积满足单机上的卡的数量。
tokenizer-name-or-path：在VOCAB_FILE处配置词表路径。

概述

AscendSpeed新增评估模块，支持业界主流模型的下游任务和评测数据集，并支持下游任务和数据集可扩展，依赖于在线推理模块的接口。

“在tasks/evaluation”目录下提供eval_api接口与eval_impl调用文件。针对不同下游任务数据集提供不同的评估类，并调用对应的评估接口。

在eval_impl目录中，每一个.py文件对应一个数据集的评估，其中gsm8k、mmlu、ceval数据集需要做fewshot，因此在同目录的fewshot_template文件夹中提供了数据集的fewshot模板，运行的时候会读取拼接。

每一个.py文件中定义了具体任务的实现类：

class BoolqEval(DatasetEval):
    def __init__(self, test_dir,
                 instruction_template="{passage}\nQuestion: {question}?\nAnswer:"):
        self.test_dir = test_dir
        self.instruction_template = instruction_template

class CEvalExam(DatasetEval):
    def __init__(self, test_dir,
                 instruction_template="{fewshot_template}\n\n问：{question}\n答："):
        self.test_dir = test_dir
        self.instruction_template = instruction_template

class Gsm8kEval(DatasetEval):
    def __init__(self, test_dir,
                 instruction_template="{fewshot_template}\n\n{question}",
                 output_template=r'The answer is (.*?) .'):
        self.test_dir = test_dir
        self.instruction_template = instruction_template
        self.output_template = output_template

class MmluEval(DatasetEval):
    def __init__(self, test_dir,
                 instruction_template="{few_shot_examples}\n\n"
                                      "{question}\nAnswer:",
                 output_template1=r".*(?P<answer>[A|B|C|D])\..*",
                 output_template2=r"(?P<answer>[A|B|C|D])"):
        self.test_dir = test_dir
        self.instruction_template = instruction_template
        self.output_template = [output_template1, output_template2]

任务初始化定义传入参数说明：

test_dir：str型，测试数据路径；
instruction_template：str型，instruction的模板，是否需要进行fewshot依据具体任务进行
output_template：str型，输出回答的模板

AscendSpeed主要提供以下类型任务（YES/NO、UNIQUE ANSWER、MULTIPLE CHOICES）（待更新），每个测评数据集继承实现eval(llm_chat)方法，llm_chat为准备测评的LLM模型，先提供确定答案数据集的评估，计算的指标提供accurcy=答案准确的题目数/总题目数。如Boolq数据集中的得分指标计算：

if rank == 0:
    logging.info("Boolq dataset acc = %d/%d=%e", acc_n, len(boolq_question_list),
                  acc_n / len(boolq_question_list))
    total_n += len(boolq_question_list)
    total_acc_n += acc_n
    answer_result['Boolq_dataset'] = subject_result
    score_datas.append(['Boolq_dataset', len(boolq_question_list), acc_n / len(boolq_question_list)])

具体需要输入数据对应加载路径，instruction模板，配置输出模板的文件会进行初始化。

下游任务	数据集名称	答案类型	评测模型
Common Sense Reasoning	BoolQ (Clark et al., 2019)	T or F	LLAMA
Mathematical reasoning	GSM8K	Unique answer	LLAMA
Massive Multitask Language Understanding	MMLU	Multiple choices	LLAMA、LLAMA2、baichuan

在三类下游任务上的评测结果如下：

点击放大