无效的RankTable配置(EI00004)
问题现象
执行日志报错:EI0004 “The ranktable or rank is invalid”,如下所示。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | custom_group :None dtype :float32 data :1 iter :1 profiling :false pid :1040540 [2024-04-24 06:31:38.087571: F ge_plugin. cc:338] [GePlugin] Initialize ge failed, ret :failed Error Message is: EI0004: 2024-04-24-06:31:33.915.634 The ranktable or rank is invalid, Reason:[The ip in ranktable is not a valid ip address]. Please check the configured ranktable. [The ranktable path configured in the training can be found in the plogs.] Solution: Try again with a valid cluster configuration in the ranktable file. Ensure that the configuration matches the operating environment. TraceBack (most recent call last): PluginManager InvokeAll failed. [FUNC:Initialize][FILE:ops_kernel_manager. cc][LINE :89] OpsManager initialize failed. [FUNC:InnerInitialize][FILE:gelib. cc] [LINE :241] GELib::InnerInitialize failed. [FUNC:Initialize] [FILE:gelib.cc][LINE:169] GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api. cc][LINE :307] |
原因分析
该报错常见于rank id或ranktable的数据校验阶段,通常有以下几种原因:
- rank id的值不符合预期(过大的非法值、非数值、超过ranktable中的rank个数),或者同Server内已有rank id重复。
- ranktable的格式或参数错误:版本不对,路径不对,格式不对,起始rank不为0,Server内的rank个数不正确,server_count、server_list、super_pod_id、server_id、server_index、ip等参数为空或配置不正确。
解决方法
针对执行日志提示字段中的“Reason”,可以基本看出问题方向,基于此排查配置下发参数的rank id或者ranktable,即可确认问题点。
父主题: HCCL常见问题总结