数据集使用

以下是两个用于性能测试的常见数据集,在此提供两个脚本用于自动化加载模型,将数据集转换为tokenid。需要注意OA数据的平均SequenceLen较长,总量超过三千条,在模型体量大(65B及以上)而服务化配置的MaxBatchSize较小时,跑完整个数据集耗时久,可能需要数个小时。

OA数据集

  1. 单击链接获取OA数据集。
  2. 转换为tokenid方式。

    使用tokenizer_model.encode进行加密。

    python脚本示例参考如下:

    import csv
    from pathlib import Path
    import pyarrow.parquet as pq
    import glob, os
    from transformers import AutoTokenizer
    def read_oa(dataset_path, tokenizer_model):
        out_list = []
        for file_path in glob.glob((Path(dataset_path) / "*.parquet").as_posix()):
            file_name = file_path.split("/")[-1].split("-")[0]
            data_dict = pq.read_table(file_path).to_pandas()
            data_dict = data_dict[data_dict['lang'] == 'zh']
            ques_list = data_dict['text'].to_list()
            for ques in ques_list:
                tokens = tokenizer_model.encode(ques)
                if len(out_list) <= 2048:
                    out_list.append(tokens)
                else:
                    out_list.append(tokens[0:2048])
        return out_list
    def save_csv(file_path, out_tokens_list):
        with open(file_path, 'w', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            for row in out_tokens_list:
                csv_writer.writerow(row)
    if __name__ == '__main__':
        model_path = "/data/models/baichuan2-7b"
        oa_dir = "/home/xxx/oasst1"
        save_path = "oa_tokens.csv"
        tokenizer_model = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
        tokens_lists = read_oa(oa_dir, tokenizer_model)
        save_csv(save_path, tokens_lists)