支持的本地数据格式:json、jsonl、parquet、txt、csv。
运行脚本tools/preprocess_data.py。
以下例子进行说明
huggingface上有完整数据:(比如THUCNews)
python tools/preprocess_data.py \ --input /your/path/to/cnews.train.txt \ --output-prefix thucnews \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --merge-file /your/path/to/gpt2-merges.txt \ --vocab /your/path/to/gpt2-vocab.json \ --append-eod \ --workers 2
huggingface仅提供数据解析脚本:(比如wiki)
hf_config_json="./hf_ds_json.json" cat <<EOT > $hf_config_json { "path": "wikipedia", "name": "20220301.en" } EOT
python tools/preprocess_data.py \ --input /home/to/data/wikipedia \ --output-prefix wikipedia \ --hf-datasets-params ${hf_config_json} \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --merge-file /your/path/to/gpt2-merges.txt \ --vocab /your/path/to/gpt2-vocab.json \ --append-eod \ --workers 2
微调数据制作:(比如alpaca)
wget https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data-zh/tree/main
python tools/preprocess_data.py \ --input /your/path/to/alpaca_cn/data \ --handler-name GeneralInstructionHandler \ --output-prefix alpaca_cn \ --dataset-impl mmap \ --tokenizer-type PretrainedFromHF \ --tokenizer-name-or-path your/path/to/llama_model \ --tokenizer-not-use-fast \ --append-eod \ --workers 8
主要参数说明:
针对huggingface公开数据集,进行的数据预处理流程如下:
raw_datasets = load_dataset( args.input, split=split_flag, num_proc=None if args.streaming else args.workers, cache_dir=cache_dir, streaming=args.streaming )
data_files = [args.input] if os.path.isfile(args.input) else \ glob.glob(os.path.join(args.input, '*')) ext, data_format = _get_data_format(data_files) filtered_data_files = list(filter(lambda x: x.split('.')[-1] == ext, data_files))
class AlpacaTemplate: system_token = "" user_token = "### Instruction:" assistant_token = "### Response:" end_token = "" system = "Below is an instruction that describes a task, paired with an input that provides further context. " "Write a response that appropriately completes the request. " "Please note that you need to think through your response logically and step by step."
def generate_training_prompt(self, messages) -> str: prompt = self.template.system_token + "\n" + self.template.system + self.template.end_token + "\n" for message in messages: if message["role"] == self.user_role: prompt += self.template.user_token + "\n" + message["content"] + self.template.end_token + "\n" else: prompt += self.template.assistant_token + "\n" + message["content"] \ + self.template.end_token + "\n" return prompt
最终呈现的prompt的形式是:instruction+content+end_token的结构。
def get_tokenized_data(self): """get tokenized(and prompted) data""" columns = next(iter(self.raw_datasets)).keys() remove_columns = list(set(columns) - set(self.args.json_keys)) proc_kwargs = {} if self.args.streaming else {"num_proc": self.args.workers} return self.raw_datasets.map(self._filter, remove_columns=remove_columns, **proc_kwargs)