NPU训练环境准备

完成MindX DL的安装后，可使用yaml下发一个训练任务，检测系统是否可以正常运行。

获取训练镜像

可选择以下方式中的一种来获取训练镜像

（推荐）从昇腾镜像仓库根据系统架构（ARM/X86）下载训练基础镜像（如：ascend-mindspore、mindspore-modelzoo）。基于训练基础镜像进行修改，将容器中默认用户修改为root（21.0.4版本之后训练基础镜像默认用户为非root）。基础镜像中不包含训练脚本、代码等文件，训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。

可基于训练基础镜像定制用户自己的训练镜像，制作过程请参见使用Dockerfile构建容器镜像（MindSpore）。

可将训练镜像重命名，如：mindspore:b035。

（推荐）加固镜像

可参考容器镜像安全加固。

获取训练脚本

使用断点续训功能，下载MindSpore代码仓中r1.5分支的resnet代码作为训练代码。
不使用断点续训功能，登录ModelZoo，下载MindSpore框架的“ResNet-50”训练代码。

准备数据集

请用户自行准备ResNet-50对应的数据集，使用时请遵守对应规范。
管理员用户上传数据集到存储节点。
1. 进入“/data/atlas_dls/public”目录，将数据集上传到任意位置，如“/data/atlas_dls/public/dataset/imagenet”。
```
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# pwd
/data/atlas_dls/public/dataset/imagenet
```
2. 执行du -sh命令，查看数据集大小。
```
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# du -sh
176M
```

修改训练脚本

创建代码目录。执行以下命令，在host上创建代码目录。
mkdir /data/atlas_dls/code
获取训练脚本。
1. 在本地解压获取训练脚本中下载的训练代码，上传到创建好的代码目录下。
  - 如果训练任务不需要断点续训功能，上传“ResNet50_for_MindSpore_{version}_code”目录，其中version表示代码版本。
  - 如果训练任务需要断点续训功能，上传“models-r1.5/models-r1.5/official/cv”下的“resnet”目录。后续步骤以“resnet”目录为例。
2. 进入“MindXDL-deploy”仓库，选择“3.0.RC3”分支。获取“samples/train”目录中的“train_start.sh”、“utils.sh”和“rank_table.sh”文件，结合训练代码中“scripts”目录，在host上构造成如下的目录结构。
```
root@ubuntu:/data/atlas_dls/code/ResNet50_for_MindSpore_1.4_code/scripts/#
scripts/
├── docker_start.sh
├── run_standalone_train_gpu.sh
├── run_standalone_train.sh
 ...
├── rank_table.sh
├── utils.sh
└── train_start.sh
```

如果不使用断点续训功能，可跳过此步骤。否则，需要修改“/data/atlas_dls/code/resnet”目录的配置文件“resnet50_imagenet2012_config.yaml”。

模型保存和加载设置，图编译保存和加载设置。

...
run_distribute: False
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train" # 修改checkpoint保存路径，请用户根据实际情况进行修改
load_path: "/cache/checkpoint_path/"
device_target: "Ascend"
checkpoint_path: "./checkpoint/"
checkpoint_file_path: ""
...
net_name: "resnet50"
dataset: "imagenet2012"
device_num: 1
pre_trained: "/job/code/output/checkpoint/ckpt_0" # 容器内预训练模型加载路径（支持目录和文件），请用户参考训练yaml根据实际情况进行修改
run_eval: False
eval_dataset_path: ""
parameter_server: False
filter_weight: False
save_best_ckpt: True
eval_start_epoch: 40
...
network_dataset: "resnet50_imagenet2012"


# 再训练选项 
save_graphs: False  # 是否开启图编译结果保存
save_graphs_path: "./graphs" # 图编译结果保存路径
has_trained_epoch: 0 # 模型预训练的epoch，默认是0
has_trained_step: 0 # 模型预训练的step，默认是0
---
# 每项配置的帮助说明
enable_modelarts: "Whether training on modelarts, default: False"
...
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
checkpoint_file_path: "The location of the checkpoint file."
save_graphs: "Whether save graphs during training, default: False."
save_graphs_path: "Path to save graphs."

进入“/data/atlas_dls/code/resnet/”目录，针对checkpoint保存和加载功能，检查“train.py”文件。如有需要，请参考MindSpore官网教程进行修改。

...
def set_parameter():
    """set_parameter"""
    target = config.device_target
    if target == "CPU":
        config.run_distribute = False


    # init context
    rank_save_graphs_path = os.path.join(config.save_graphs_path, "soma")


    # Whether open graph saving
    config.save_graphs = not config.pre_trained # 设置图编译结果是否保存

    # init context
    if config.mode_name == 'GRAPH':
        if target == "Ascend":
            rank_save_graphs_path = os.path.join(config.save_graphs_path, "soma", str(os.getenv('DEVICE_ID')))
            context.set_context(mode=context.GRAPH_MODE, device_target=target, save_graphs=config.save_graphs,
                                save_graphs_path=rank_save_graphs_path)
        else:
            context.set_context(mode=context.GRAPH_MODE, device_target=target, save_graphs=config.save_graphs)
        set_graph_kernel_context(target, config.net_name)
...
def load_pre_trained_checkpoint():
    """
    Load checkpoint according to pre_trained path.
    """
    param_dict = None
    if config.pre_trained:
        if os.path.isdir(config.pre_trained):
            ckpt_save_dir = os.path.join(config.output_path, config.checkpoint_path, "ckpt_0")
            ckpt_pattern = os.path.join(ckpt_save_dir, "*.ckpt")
            ckpt_files = glob.glob(ckpt_pattern)
            if not ckpt_files:
                logger.warning(f"There is no ckpt file in {ckpt_save_dir}, "
                               f"pre_trained is unsupported.")
            else:
                ckpt_files.sort(key=os.path.getmtime, reverse=True)
                time_stamp = datetime.datetime.now()
                print(f"time stamp {time_stamp.strftime('%Y.%m.%d-%H:%M:%S')}"
                      f" pre trained ckpt model {ckpt_files[0]} loading",
                      flush=True)
                param_dict = load_checkpoint(ckpt_files[0])
        elif os.path.isfile(config.pre_trained):
            param_dict = load_checkpoint(config.pre_trained)
        else:
            print(f"Invalid pre_trained {config.pre_trained} parameter.")
    return param_dict
...
@moxing_wrapper()
def train_net():
    """train net"""
    target = config.device_target
    set_parameter()
    ckpt_param_dict = load_pre_trained_checkpoint()
    dataset = create_dataset(dataset_path=config.data_path, do_train=True, repeat_num=1,
                             batch_size=config.batch_size, train_image_size=config.train_image_size,
                             eval_image_size=config.eval_image_size, target=target,
                             distribute=config.run_distribute)
    step_size = dataset.get_dataset_size()
...
    time_cb = TimeMonitor(data_size=step_size)
    loss_cb = LossCallBack(config.has_trained_epoch)
    cb = [time_cb, loss_cb]
    ckpt_save_dir = set_save_ckpt_dir()
    if config.save_checkpoint:
        ckpt_append_info = [{"epoch_num": config.has_trained_epoch, "step_num": config.has_trained_step}]
        config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_epochs * step_size,
                                     keep_checkpoint_max=config.keep_checkpoint_max,
                                     append_info=ckpt_append_info)
        ckpt_cb = ModelCheckpoint(prefix="resnet", directory=ckpt_save_dir, config=config_ck)
        cb += [ckpt_cb]

父主题： MindSpore