服务监控指标查询接口（普罗格式）

当使用HTTP协议方式使用该接口的时候，请保护该接口不被终端用户直接调用，并且在局域网内使用。
如需使用该接口，请确保在启动服务前开启服务化监控开关。开启服务化监控功能的命令如下：
```
export MIES_SERVICE_MONITOR_MODE=1
```

接口功能

查询推理服务化的相关服务监控指标。

接口格式

操作类型：GET

URL：http://{ip}:{port}/metrics

{ip}和{port}请使用管理面的IP地址和指标端口号，即“managementIpAddress”和“metricsPort”。

请求参数

无

使用样例

请求样例：

GET http://{ip}:{port}/metrics

响应样例：

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 7901
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 1
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 1
exposer_request_latencies_sum 282
exposer_request_latencies{quantile="0.5"} Nan
exposer_request_latencies{quantile="0.9"} Nan
exposer_request_latencies{quantile="0.99"} Nan
# HELP request_received_total Number of requests received so far.
# TYPE request_received_total counter
request_received_total{model_name="llama2-7b"} 3188
# HELP request_success_total Number of responses sent so far
# TYPE request_success_total counter
request_success_total{model_name="llama2-7b"} 2267
# HELP request_failed_total Number of responses failed so far
# TYPE request_failed_total counter
request_failed_total{model_name="llama2-7b"} 0
# HELP prompt_tokens_total Number of prefill tokens processed.
# TYPE prompt_tokens_total counter
prompt_tokens_total{model_name="llama2-7b"} 9564
# HELP generation_tokens_total Number of generation tokens processed.
# TYPE generation_tokens_total counter
generation_tokens_total{model_name="llama2-7b"} 84425
# HELP avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE avg_prompt_throughput_toks_per_s gauge
avg_prompt_throughput_toks_per_s{model_name="llama2-7b"} 0.586739718914032
# HELP avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE avg_generation_throughput_toks_per_s gauge
avg_generation_throughput_toks_per_s{model_name="llama2-7b"} 2.375296831130981
# HELP failed_request_perc Requests failure rate. 1 means 100 percent usage.
# TYPE failed_request_perc gauge
failed_request_perc{model_name="llama2-7b"} 0
# HELP npu_cache_usage_perc NPU KV-cache usage. 1 means 100 percent usage.
# TYPE npu_cache_usage_perc gauge
npu_cache_usage_perc{model_name="llama2-7b"} 1
# HELP cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE cpu_cache_usage_perc gauge
cpu_cache_usage_perc{model_name="llama2-7b"} 0
# HELP time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE time_to_first_token_seconds histogram
time_to_first_token_seconds_count{model_name="llama2-7b"} 2523
time_to_first_token_seconds_sum{model_name="llama2-7b"} 9740.00200343132
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.001"} 0
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.005"} 0
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.01"} 0
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.02"} 0
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.04"} 0
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.06"} 10
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.08"} 54
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.1"} 104
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.25"} 256
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.5"} 256
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.75"} 276
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="1"} 321
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="2.5"} 628
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="5"} 1148
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="7.5"} 2523
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="10"} 2523
time_to_first_token_seconds_bucket{model_name="llama2-7b",le="+Inf"} 2523
# HELP time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE time_per_output_token_seconds histogram
time_per_output_token_seconds_count{model_name="llama2-7b"} 85800
time_per_output_token_seconds_sum{model_name="llama2-7b"} 4445.857012826018
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.01"} 0
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.025"} 0
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.05"} 3
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.075"} 12
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.1"} 40283
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.15"} 83145
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.2"} 83339
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.3"} 83339
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.4"} 83539
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.5"} 85139
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.75"} 85740
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="1"} 85800
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="2.5"} 85800
time_per_output_token_seconds_bucket{model_name="llama2-7b",le="+Inf"} 85800
# HELP e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE e2e_request_latency_seconds histogram
e2e_request_latency_seconds_count{model_name="llama2-7b"} 2267
e2e_request_latency_seconds_sum{model_name="llama2-7b"} 12684.5319980979
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="1"} 27
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="2.5"} 268
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="5"} 712
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="10"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="15"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="20"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="30"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="40"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="50"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="60"} 2267
e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="+Inf"} 2267
# HELP request_prompt_tokens Number of prefill tokens processed.
# TYPE request_prompt_tokens histogram
request_prompt_tokens_count{model_name="llama2-7b"} 3188
request_prompt_tokens_sum{model_name="llama2-7b"} 9564
request_prompt_tokens_bucket{model_name="llama2-7b",le="10"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="50"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="100"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="200"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="500"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="1000"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="2000"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="5000"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="10000"} 3188
request_prompt_tokens_bucket{model_name="llama2-7b",le="+Inf"} 3188
# HELP request_generation_tokens Number of generation tokens processed.
# TYPE request_generation_tokens histogram
request_generation_tokens_count{model_name="llama2-7b"} 2267
request_generation_tokens_sum{model_name="llama2-7b"} 84425
request_generation_tokens_bucket{model_name="llama2-7b",le="10"} 0
request_generation_tokens_bucket{model_name="llama2-7b",le="50"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="100"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="200"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="500"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="1000"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="2000"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="5000"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="10000"} 2267
request_generation_tokens_bucket{model_name="llama2-7b",le="+Inf"} 2267

响应状态码：200

输出说明

参数	类型	说明
exposer_transferred_bytes_total	Counter	服务监控指标查询接口返回的总字节数。说明：后续版本将会删除，不建议使用。
exposer_scrapes_total	Counter	服务监控指标查询接口被抓取的总次数。说明：后续版本将会删除，不建议使用。
exposer_request_latencies	Summary	服务监控指标查询接口抓取请求的时延，单位为微秒。 exposer_request_latencies_count：抓取请求的总次数。 exposer_request_latencies_sum：抓取请求的总延迟时间，单位为微秒。 exposer_request_latencies：抓取请求时延的摘要数据。 quantile：分位数。说明：后续版本将会删除，不建议使用。
request_received_total	Counter	服务端到目前为止接收到的推理请求个数。 model_name：使用的模型名称。string类型，如果有多个模型，请使用"&"进行拼接。说明：响应样例中的所有model_name都为此含义。
request_success_total	Counter	服务端到目前为止推理成功的请求个数。
request_failed_total	Counter	服务端到目前为止推理失败的请求个数。
prompt_tokens_total	Counter	已经处理的prefill tokens数量。
generation_tokens_total	Counter	已经处理的generation tokens数量。
avg_prompt_throughput_toks_per_s	Gauge	截止上一个请求完成，最新的平均prefill吞吐，单位为tokens/s。
avg_generation_throughput_toks_per_s	Gauge	截止上一个token生成，最新的平均generation吞吐，单位为tokens/s。
failed_request_perc	Gauge	服务端到目前为止执行失败的请求率。失败的请求率=执行失败的请求个数/接收到的请求个数，1代表100%。
npu_cache_usage_perc	Gauge	当前KV Cache的NPU显存利用率，1代表100%。
cpu_cache_usage_perc	Gauge	当前KV Cache的NPU显存利用率，1代表100%。
time_to_first_token_seconds	Histogram	首token时延，代表请求生成首个推理token消耗的时间，单位为秒。 time_to_first_token_seconds_count：截止目前，完成并统计首token时延的请求个数。 time_to_first_token_seconds_sum：截止目前，完成并统计首token时延的所有请求的首token时延的加和。 time_to_first_token_seconds_bucket：截止目前，直方图分桶统计的请求的首token时延数据。 le：less than or equal to，是直方图分桶的界限。说明：响应样例中的所有le都为此含义。
time_per_output_token_seconds	Histogram	token生成时延，代表连续两个token生成之间的时间间隔，单位为秒。 time_per_output_token_seconds_count：截止目前为止，完成并统计token生成时延的token个数。 time_per_output_token_seconds_sum：截止目前为止，完成并统计token生成时延的所有token的token生成时延的加和。 time_per_output_token_seconds_bucket：截止目前为止，直方图分桶统计的token生成时延数据。
e2e_request_latency_seconds	Histogram	端到端时延，代表请求从接收到执行完成消耗的时间，单位为秒。 e2e_request_latency_seconds_count：截止目前为止，完成并统计端到端时延的请求个数。 e2e_request_latency_seconds_sum：截止目前为止，完成并统计端到端时延的所有请求的端到端时延的加和。 e2e_request_latency_seconds_bucket：截止目前为止，直方图分桶统计的请求的端到端时延数据。
request_prompt_tokens	Histogram	请求输入的token数量，代表请求输入的prompt经过tokenzier之后得到的token个数。 request_prompt_tokens_count：截止目前为止，完成并统计当前指标的请求个数。 request_prompt_tokens_sum：截止目前为止，完成并统计当前指标的所有请求的输入的token数量的加和。 request_prompt_tokens_bucket：截止目前为止，直方图分桶统计的请求的输入的token数量数据。
request_generation_tokens	Histogram	请求输出的token数量，代表请求经过模型推理之后得到的token个数。 request_generation_tokens_count：截止目前为止，完成并统计当前指标的请求个数。 request_generation_tokens_sum：截止目前为止，完成并统计当前指标的所有请求的输出的token数量的加和。 request_generation_tokens_bucket：截止目前为止，直方图分桶统计的请求的输出的token数量数据。

父主题： EndPoint管理面接口