单卡训练手动迁移

本文档提供的手工迁移样例是基于PyTorch官网的Imagenet数据集训练模型脚本代码的,以PyTorch1.8.1为例。

  1. 在main.py脚本中导入torch_npu模块。

    import torch
    import torch_npu

  2. 将节点的GPU计数修改为NPU计数。

    代码位置:main.py文件中的main()函数。

    原代码如下:

    if torch.cuda.is_available():
            ngpus_per_node = torch.cuda.device_count()
        else:
            ngpus_per_node = 1
    修改后代码如下:
    if torch_npu.npu.is_available():
            ngpus_per_node = torch_npu.npu.device_count()
        else:
            ngpus_per_node = 1

  3. 将模型以及损失函数迁移到昇腾910 AI处理器上进行计算。其中CUDA接口均替换为NPU接口。

    原代码如下:
    if not torch.cuda.is_available() and not torch.backends.mps.is_available():
            print('using CPU, this will be slow')
        elif args.distributed:
            # For multiprocessing distributed, DistributedDataParallel constructor
            # should always set the single device scope, otherwise,
            # DistributedDataParallel will use all available devices.
            if torch.cuda.is_available():
                if args.gpu is not None:
                    torch.cuda.set_device(args.gpu)
                    model.cuda(args.gpu)
                    # When using a single GPU per process and per
                    # DistributedDataParallel, we need to divide the batch size
                    # ourselves based on the total number of GPUs of the current node.
                    args.batch_size = int(args.batch_size / ngpus_per_node)
                    args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
                    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
                else:
                    model.cuda()
                    # DistributedDataParallel will divide and allocate batch_size to all
                    # available GPUs if device_ids are not set
                    model = torch.nn.parallel.DistributedDataParallel(model)
        elif args.gpu is not None and torch.cuda.is_available():
            torch.cuda.set_device(args.gpu)
            model = model.cuda(args.gpu)
        #elif torch.backends.mps.is_available():
            #device = torch.device("mps")
            #model = model.to(device)
        else:
            # DataParallel will divide and allocate batch_size to all available GPUs
            if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
                model.features = torch.nn.DataParallel(model.features)
                model.cuda()
            else:
                model = torch.nn.DataParallel(model).cuda()
    
        if torch.cuda.is_available():
            if args.gpu:
                device = torch.device('cuda:{}'.format(args.gpu))
            else:
                device = torch.device("cuda")
        #elif torch.backends.mps.is_available():
            #device = torch.device("mps")
        else:
            device = torch.device("cpu")
        # define loss function (criterion), optimizer, and learning rate scheduler
        criterion = nn.CrossEntropyLoss().to(device)
    修改后代码如下:
     
    if not torch_npu.npu.is_available() and not torch.backends.mps.is_available():
            print('using CPU, this will be slow')
        elif args.distributed:
            # For multiprocessing distributed, DistributedDataParallel constructor
            # should always set the single device scope, otherwise,
            # DistributedDataParallel will use all available devices.
            if torch_npu.npu.is_available():
                if args.gpu is not None:
                    torch_npu.npu.set_device(args.gpu)
                    model.npu(args.gpu)
                    # When using a single GPU per process and per
                    # DistributedDataParallel, we need to divide the batch size
                    # ourselves based on the total number of GPUs of the current node.
                    args.batch_size = int(args.batch_size / ngpus_per_node)
                    args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
                    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
                else:
                    model.npu()
                    # DistributedDataParallel will divide and allocate batch_size to all
                    # available GPUs if device_ids are not set
                    model = torch.nn.parallel.DistributedDataParallel(model)
        elif args.gpu is not None and torch_npu.npu.is_available():
            torch_npu.npu.set_device(args.gpu)
            model = model.npu(args.gpu)
        elif torch.backends.mps.is_available():
            device = torch.device("mps")
            model = model.to(device)
        else:
            # DataParallel will divide and allocate batch_size to all available GPUs
            if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
                model.features = torch.nn.DataParallel(model.features)
                model.npu()
            else:
                model = torch.nn.DataParallel(model).npu()
    
        if torch_npu.npu.is_available():
            if args.gpu:
                device = torch.device('npu:{}'.format(args.gpu))
            else:
                device = torch.device("npu")
        #elif torch.backends.mps.is_available():
            #device = torch.device("mps")
        else:
            device = torch.device("cpu")
        # define loss function (criterion), optimizer, and learning rate scheduler
        criterion = nn.CrossEntropyLoss().to(device)

  4. 将断点续训的接口和设备替换为NPU。

    原代码如下:

    if args.gpu is None:
        checkpoint = torch.load(args.resume)
    elif torch.cuda.is_available():
        # Map model to be loaded to specified single gpu.
        loc = 'cuda:{}'.format(args.gpu)
        checkpoint = torch.load(args.resume, map_location=loc)

    修改后代码如下:

    if args.gpu is None:
        checkpoint = torch.load(args.resume)
    elif torch_npu.npu.is_available():
        # Map model to be loaded to specified single gpu.
        loc = 'npu:{}'.format(args.gpu)
        checkpoint = torch.load(args.resume, map_location=loc)

  5. 将数据集迁移到昇腾910 AI处理器上进行计算。

    代码位置:main.py文件中validate()函数内的run_validate()函数。

    原代码中数据集在GPU上进行加载计算,原代码如下:

    if args.gpu is not None and torch.cuda.is_available():
        images = images.cuda(args.gpu, non_blocking=True)
        #if torch.backends.mps.is_available():
            #images = images.to('mps')
            #target = target.to('mps')
        if torch.cuda.is_available():
            target = target.cuda(args.gpu, non_blocking=True)
    将数据集迁移到NPU上进行计算,修改后代码如下:
    if args.gpu is not None and torch_npu.npu.is_available():
        images = images.npu(args.gpu, non_blocking=True)
        #if torch.backends.mps.is_available():
            #images = images.to('mps')
            #target = target.to('mps')
        if torch_npu.npu.is_available():
            target = target.npu(args.gpu, non_blocking=True)

  6. 将损失取平均函数中的CUDA接口和GPU替换为NPU。

    代码位置:class AverageMeter(object)中的all_reduce()函数

    原代码如下:

    def all_reduce(self):
        if torch.cuda.is_available():
            device = torch.device("cuda")
            ……

    修改后代码如下:

    def all_reduce(self):
        if torch_npu.npu.is_available():
            device = torch.device("npu")
            ……