下载
中文
注册

训练前NPU环境检查文件

文件说明

  • 文件说明:训练启动前,通过hccn_tool工具进行查询,记录各NPU网口IP、掩码、收发报文统计、历史link统计信息。训练启动前,通过npu-smi工具进行查询芯片健康信息。
  • 命名约束:npu_info_before.txt。
  • 存放路径约束:

采集方式说明

  • 在训练前使用hccn_tool工具查询各NPU环境检查文件,并将查询指令和查询结果保存到npu_info_before.txt文件中。
  • 涉及命令及示例如下:
    • 执行以下命令,查询网络健康状态。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -g
      回显如下:
      net health status: Init
    • 执行以下命令,查询RoCE物理链路连接状态。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -g
      回显如下:
      link status: UP
    • 执行以下命令,查询RoCE网络光模块信息。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -g | grep prese
      回显如下:
      present              : present
    • 执行以下命令,查询互联TLS开关配置。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switch
      回显如下:
      dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
    • 执行以下命令,查询Fec模式信息。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -g
      回显如下:
      fec mode: rs FEC mode
    • 执行以下命令,查询IP及掩码信息。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -g

      回显如下:

      ipaddr:10.xx.xx.10
      netmask:255.255.255.0
    • 执行以下命令,查询收发报文统计信息。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g

      回显如下:

      packet statistics:
      mac_tx_mac_pause_num:0
      mac_rx_mac_pause_num:0
      mac_tx_pfc_pkt_num:0
      ...
      roce_qp_status_err_num:0
      nic_tx_all_pkg_num:122404
      nic_tx_all_oct_num:16921741
      nic_rx_all_pkg_num:6414803
      nic_rx_all_oct_num:482237805
    • 执行以下命令,查询网口历史link统计信息。
      /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -g

      回显如下:

      [device 0]current time        : Wed Jun  7 10:08:28 2023
      [device 0]link up count       : 2
      [device 0]link change records :
      [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
      [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
      [device 0]    Tue Jun  6 16:31:55 2023    LINK UP

      文件存储示例如下,示例仅为0卡存储示例,请用户采集所有卡的信息。

      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g
      net health status: Init
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g
      link status: UP
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g | grep prese
      present              : present
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch
      dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g
      fec mode: rs FEC mode
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
      ipaddr:10.xx.xx.10
      netmask:255.255.255.0
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
      packet statistics:
      mac_tx_mac_pause_num:0
      mac_rx_mac_pause_num:0
      mac_tx_pfc_pkt_num:0
      ...
      roce_qp_status_err_num:0
      nic_tx_all_pkg_num:122404
      nic_tx_all_oct_num:16921741
      nic_rx_all_pkg_num:6414803
      nic_rx_all_oct_num:482237805
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g
      [device 0]current time        : Wed Jun  7 10:08:28 2023
      [device 0]link up count       : 2
      [device 0]link change records :
      [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
      [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
      [device 0]    Tue Jun  6 16:31:55 2023    LINK UP
      每条采集命令的结果之间需间隔1行。示例如下:
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
      XXXX
      
      /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
  • 训练前使用npu-smi工具查询芯片健康信息,并将查询指令和查询结果保存到npu_info_before.txt文件中。涉及命令及示例如下:
    • 执行以下命令,查询训练设备的基础信息。
      /usr/local/bin/npu-smi info
      回显如下:
      +------------------------------------------------------------------------------------------------+
      | npu-smi 24.1.rc1                 Version: 24.1.rc1                                             |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                | OK            | 67.0        44                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2505 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
    • 执行以下命令,查询高带宽内存ECC计数信息。
      /usr/local/bin/npu-smi info -i ${device_id} -t ecc
      回显如下:
      NPU ID                                   : 1
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
    • 执行以下命令,查询硬件基本信息。
      /usr/local/bin/npu-smi info -i ${device_id} -t board
      回显如下:
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19E5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
    • 执行以下命令,查询硬件基本信息和指定卡的名称。
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t board
      回显如下:
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
    • 执行以下命令,查询内存用量。
      /usr/local/bin/npu-smi info -i ${device_id} -t usages
      回显如下:
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
    • 执行以下命令,查询芯片健康信息。
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t health
      回显如下:
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      文件存储示例如下,请用户采集所有卡的信息。
      /usr/local/bin/npu-smi info
      +------------------------------------------------------------------------------------------------+
      | npu-smi 23.0.5                   Version: 23.0.5                                               |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      | 0     xxx                | OK            | 73.1        37                0    / 0             |
      | 0                         | 0000:61:00.0  | 0           920  / 13553      0    / 32768         |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                | OK            | 67.0        38                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2346 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
      Health Status                  : OK
      Error Code                     : NA
      Error Information              : NA
      
      /usr/local/bin/npu-smi info -i 0 -t ecc
      NPU ID                                   : 0
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
      
      /usr/local/bin/npu-smi info -i 0 -t board
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19E5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t board
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
      
      /usr/local/bin/npu-smi info -i 0 -t usages
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      ...
    • 每条采集命令的结果之间需间隔1行。示例如下:
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
      XXXX
      
      /usr/local/bin/npu-smi info -i 1 -c 0 -t health
  • 在训练前使用其他相关命令查询各NPU环境检查文件,并将查询指令和查询结果保存到npu_info_before.txt文件中。涉及命令及示例如下:
    • 执行以下命令,查询当前系统时间。
      datetime=$(date "+%Y-%m-%d %H:%M:%S")
      echo "Datetime: $datetime">>${save_file}
      echo -e "\n">>${save_file}
      回显如下:
      Datetime: 2024-06-26 01:13:36
    • 执行以下命令,查询驱动版本号。
      cat /usr/local/Ascend/driver/version.info
      回显如下:
      Version=24.1.rc1
      ascendhal_version=7.35.19
      aicpu_version=1.0
      tdt_version=1.0
      log_version=1.0
      prof_version=2.0
      dvppkernels_version=1.1
      tsfw_version=1.0
      Innerversion=V100R001C15SPC006B220
      compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17]
      compatible_version_fw=[7.0.0,7.2.99]
    • 执行以下命令,查询固件版本号。
      /usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
      回显如下:
      {
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0).
      {"device_id":0, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3).
      {"device_id":0, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8).
      {"device_id":0, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9).
      {"device_id":0, "component":imp, "version":7.1.0.7.220}
      …
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0).
      {"device_id":7, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3).
      {"device_id":7, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8).
      {"device_id":7, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9).
      {"device_id":7, "component":imp, "version":7.1.0.7.220}
      }