变量内存超限导致训练异常
2022/07/26
144
问题信息
问题来源 | 产品大类 | 产品子类 | 关键字 |
---|---|---|---|
官方 | 模型训练 | TensorFlow | 网络Batch Size、规模、内存超限 |
问题现象描述
网络Batch Size或规模设置过大时报内存超限。
[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.258 [graphengine/ge/graph/manager/graph_var_manager.cc:285]182539 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) Out of memory : current var size[5382237696] exceeds total var size[5368709120] [ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.374 [graphengine/ge/graph/manager/graph_var_manager.cc:504]182539 AssignVarMem: ErrorNo: 1343225860(Internal errors) AssignVarMem by offset failed. [ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.420 [graphengine/ge/graph/build/memory/var_mem_assign_util.cc:65]182539 AssignStaticMemory2Node: ErrorNo: -1(failed) [ERROR] GE(179560,python3.7):2020-10-31-11:06:40.669.315 [graphengine/ge/graph/build/memory/memory_assigner.cc:27]182539 AssignMemory: ErrorNo: -1(failed) Memory assigner failed [ERROR] GE(179560,python3.7):2020-10-31-11:06:40.669.401 [graphengine/ge/graph/build/model_builder.cc:722]182539 BuildModelForGetTask: ErrorNo: -1(failed) Assign Memory Failed!
原因分析
框架默认将变量及图内存进行隔离管理,默认变量内存为5GB,图内存为26GB,变量内存超限的情况下,可以手工调整相应大小,但变量及图总内存大小不能超过31G。
解决措施
可以通过手工指定graph_memory_max_size和variable_memory_max_size的大小,来调整变量及图内存大小,例如:
from npu_bridge.npu_init import * config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(16 * 1024 * 1024 * 1024)) custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(15 * 1024 * 1024 * 1024)) config.graph_options.rewrite_options.remapping = RewriterConfig.OFF config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config) as sess: sess.run(...)
本页内容