文档
注册

Log/Summary

背景

Log和Summary算子会下沉到Device侧执行,如果用户需要捕捉Device侧的Log/Summary信息,将对应step的信息回传到Host侧查看,请参考本节内容修改训练脚本。

打印Log信息

Estimator模式下,系统会在Log信息回传到Host时启动dequeue线程,可直接打印出Device侧的Log信息,因此用户无需修改训练脚本:

1
2
3
print_op = tf.print(loss)          
with tf.control_dependencies([print_op]):             
    train_op = xxx   # print算子必须依赖图上能够执行到的节点,否则print节点不生效

而在sess.run模式下,Log信息回传到Host时不会启动dequeue线程,因此需要用户添加以下代码,单独启动dequeue线程,用于取出缓存的Log信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from threading import Thread

import sys
def dequeue():
    tf.reset_default_graph()
    outfeed_log_tensors = npu_ops.outfeed_dequeue_op(
            channel_name="_npu_log",
            output_types=[tf.string],
            output_shapes=[()])
    dequeue_ops = tf.print(outfeed_log_tensors, sys.stderr)
    with tf.Session() as sess:      // 可以复用训练session也可另起session
      i = 0
      while i < max_train_steps:    // max_train_steps为最大迭代次数
        sess.run(dequeue_ops)
        i = i + 1

t1 = Thread(target=dequeue) 
t1.start()

然后执行训练时,通过Assert或Print算子打印Log信息:

1
2
3
print_op = tf.print(loss)          
with tf.control_dependencies([print_op]):             
    train_op = xxx   # print算子必须依赖图上能够执行到的节点,否则print节点不生效

打印Summary信息

sess.run模式下,暂不支持将Summary信息回传到Host侧查看。

Estimator模式下,需要用户先定义一个host_call函数,该函数中包含了用户需要采集的Summary信息。
1
2
3
4
5
6
7
8
9
def _host_call_fn(gs, loss):
    with summary.create_file_writer(
            "./model", max_queue=1000).as_default():
        # 每个step保存一次
        with summary.always_record_summaries():   
        # 每2000个step保存一次
        #with summary.record_summaries_every_n_global_steps(2000,global_step=gs): 
            summary.scalar("host_call_loss", loss, step=gs)
            return summary.all_summary_ops()

然后通过NPUEstimatorSpec构造函数传入host_call,此时系统会在Summary算子下沉到Device侧执行时启动enqueue线程,并在Summary信息回传到Host时启动dequeue线程,用来捕捉Device侧的Summary信息,将每个或每N个step的信息传回Host侧查看。

host_call是一个function和一个tensor的列表或字典组成的元组,用于返回tensor列表,目前适用于train()和evaluate()。

1
2
3
4
from npu_bridge.npu_init import *

host_call = (_host_call_fn, [global_step, loss])
return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)

完整代码示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from npu_bridge.npu_init import *
 
# 定义一个host_call函数
from tensorflow.contrib import summary
def _host_call_fn(gs, loss):
    with summary.create_file_writer(
            "./model", max_queue=1000).as_default():
        with summary.always_record_summaries():
            summary.scalar("host_call_loss", loss, step=gs)
            return summary.all_summary_ops()
 
def input_fn():
     构建dataset
 
# 在model_fn中调用host_call捕捉想查看的信息
def model_fn():
     搭建前后向模型
  model = ***
  loss = ***
  optimizer = tf.train.MomentumOptimizer(learning_rate=c, momentum=0.9)
  global_step = tf.train.get_or_create_global_step()
  grad_vars = optimizer.compute_gradients(loss)
  minimize_op = optimizer.apply_gradients(grad_vars, global_step)
  update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
  train_op = tf.group(minimize_op, update_ops)
  host_call = (_host_call_fn, [global_step, loss])
  return NPUEstimatorSpec(mode=tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op, host_call=host_call)
 
run_config = NPURunConfig()
 
classifier = NPUEstimator(model_fn=model_fn, config=run_config, params={ })
classifier.train(input_fn=lambda: input_fn(), max_steps=1000)
搜索结果
找到“0”个结果

当前产品无相关内容

未找到相关内容,请尝试其他搜索词