Estimator分布式脚本迁移
对于Estimator的分布式脚本,使用迁移工具可支持直接迁移成分布式脚本。但如有用户原始脚本是单卡训练脚本,迁移工具迁移后并不能够进行分布式训练,但用户可以基于迁移后的脚本,通过少量手工修改使其支持分布式训练。
工具迁移后的单机脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def cnn_model_fn(features,labels,mode): #搭建网络 xxx #计算loss xxx #Configure the TrainingOp(for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) # 使用SGD优化器 train_op=distributedOptimizer.minimize(loss=loss,global_step=tf.train.get_global_step()) # 最小化loss return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op) ... hook=hk._LoggerHook(FLAGS) training_hooks = [] training_hooks.append(hook) ... estimator.train(train_data_fn, max_steps=num_steps // rank_size, hooks=training_hooks) |
手工修改后支持分布式训练(方法一):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def cnn_model_fn(features,labels,mode): #搭建网络 xxx #计算loss xxx #Configure the TrainingOp(for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) optimizer = npu_distributed_optimizer_wrapper(optimizer) # 梯度更新 train_op=distributedOptimizer.minimize(loss=loss,global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op) ... hook=hk._LoggerHook(FLAGS) training_hooks = [] training_hooks.append(hook) training_hooks.append(NPUBroadcastGlobalVariablesHook(0,int(os.getenv('RANK_ID','0')))) # 变量广播 ... estimator.train(train_data_fn, max_steps=num_steps, hooks=training_hooks) |
手工修改后支持分布式训练(方法二):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def cnn_model_fn(features,labels,mode): #搭建网络 xxx #计算loss xxx #Configure the TrainingOp(for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) optimizer = npu_distributed_optimizer_wrapper(optimizer) # 梯度更新 train_op=distributedOptimizer.minimize(loss=loss,global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode,loss=loss,train_op=train_op) ... hook=hk._LoggerHook(FLAGS) training_hooks = [] training_hooks.append(hook) ... estimator.train(train_data_fn, max_steps=num_steps, hooks=npu_hooks_append(training_hooks)) # 变量广播 |
父主题: 分布式训练脚本迁移