下载
中文
注册

无效的RankTable配置(EI00004)

问题现象

执行日志报错:EI0004 “The ranktable or rank is invalid”,如下所示。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
custom_group :None
dtype :float32
data :1
iter :1
profiling :false
pid :1040540
[2024-04-24 06:31:38.087571: F ge_plugin. cc:338] [GePlugin] Initialize ge failed, ret :failed
Error Message is:
EI0004: 2024-04-24-06:31:33.915.634 The ranktable or rank is invalid, Reason:[The ip in ranktable is not a valid ip address]. Please check the configured ranktable. [The ranktable path configured in the training can be found in the plogs.]
        Solution: Try again with a valid cluster configuration in the ranktable file. Ensure that the configuration matches the operating environment.
        TraceBack (most recent call last):
        PluginManager InvokeAll failed. [FUNC:Initialize][FILE:ops_kernel_manager. cc][LINE :89]
        OpsManager initialize failed. [FUNC:InnerInitialize][FILE:gelib. cc] [LINE :241]
        GELib::InnerInitialize failed. [FUNC:Initialize] [FILE:gelib.cc][LINE:169]
        GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api. cc][LINE :307]

原因分析

该报错常见于rank id或ranktable的数据校验阶段,通常有以下几种原因:

  1. rank id的值不符合预期(过大的非法值、非数值、超过ranktable中的rank个数),或者同Server内已有rank id重复。
  2. ranktable的格式或参数错误:版本不对,路径不对,格式不对,起始rank不为0,Server内的rank个数不正确,server_count、server_list、super_pod_id、server_id、server_index、ip等参数为空或配置不正确。

解决方法

针对执行日志提示字段中的“Reason”,可以基本看出问题方向,基于此排查配置下发参数的rank id或者ranktable,即可确认问题点。