HBM比特ECC故障
问题现象
Device的slog日志(report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log)中存在“ [fault_manager] event_id: [0x80E01809]”关键字。
Device黑匣子日志(在report/*/hisi_logs目录下)中存在“Hardware Error”报错关键字:
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2014[2454.830692] {6}[Hardware Error] Hardware error from APEI Generic Hardware Error Source: 0 /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2015[2454.830693] {6}[Hardware Error]event severity: recoverable /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2016[2454.830694] {6}[Hardware Error] Error 0, type: recoverable /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2017[2454.830696] {6}[Hardware Error] section_type: memory error /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2018[2454.830697] {6}[Hardware Error] physical_address: 0x0000101efe36d3c0 /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2019[2454.830699] {6}[Hardware Error] node: 2 card: 259 module: 51 rank: 1 bank: 10 row: 30691 column: 56 /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2020[2454.830701] {6}[Hardware Error] error_type: 3, multi-bit ECC /hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2024[2454.830702] {6}[Hardware Error] DIMM location: not present. DMI handle: 0x0000
故障根因
通过上述日志中的报错判断是HBM比特ECC故障导致AI Core ERROR。
处理方法
查找对应版本的《健康管理故障定义》,HBM比特ECC故障有如下说明(列举部分关键字段):
Event ID |
0x80E01809 |
---|---|
故障事件名称 |
HBM内存颗粒巡检多bit ECC错误。 |
故障解释/可能原因 |
HBMC巡检(Patrol Scrubbing和Demand Scrubbing)触发的多bit ECC错误,可能原因为HBM颗粒部分失效、HBM无法正常保持数据等。 |
故障影响 |
|
故障自处理模式 |
|
系统处理建议 |
无需操作。 |
父主题: 典型问题案例