如需使用该接口,请确保在启动服务前开启服务化监控开关。开启服务化监控功能的命令如下:
export MIES_SERVICE_MONITOR_MODE=1
查询集群的服务化监控指标,返回格式为Prometheus metrics。
操作类型:GET
URL:https://{ip}:{port}/metrics
无
请求样例:
GET https://{ip}:{port}/metrics
响应样例:
# HELP request_received_total Number of requests received so far. # TYPE request_received_total counter request_received_total{model_name="llama2-7b"} 3188 # HELP request_success_total Number of requests proceed successfully so far. # TYPE request_success_total counter request_success_total{model_name="llama2-7b"} 2267 # HELP request_failed_total Number of requests failed so far. # TYPE request_failed_total counter request_failed_total{model_name="llama2-7b"} 0 # HELP num_preemptions_total Cumulative number of preemption from the engine. # TYPE num_preemptions_total counter num_preemptions_total{model_name="llama2-7b"} 637 # HELP num_requests_running Number of requests currently running on NPU. # TYPE num_requests_running gauge num_requests_running{model_name="llama2-7b"} 0 # HELP num_requests_waiting Number of requests waiting to be processed. # TYPE num_requests_waiting gauge num_requests_waiting{model_name="llama2-7b"} 0 # HELP num_requests_swapped Number of requests swapped to CPU. # TYPE num_requests_swapped gauge num_requests_swapped{model_name="llama2-7b"} 0 # HELP prompt_tokens_total Number of prefill tokens processed. # TYPE prompt_tokens_total counter prompt_tokens_total{model_name="llama2-7b"} 9564 # HELP generation_tokens_total Number of generation tokens processed. # TYPE generation_tokens_total counter generation_tokens_total{model_name="llama2-7b"} 84425 # HELP avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s. # TYPE avg_prompt_throughput_toks_per_s gauge avg_prompt_throughput_toks_per_s{model_name="llama2-7b"} 0.586739718914032 # HELP avg_generation_throughput_toks_per_s Average generation throughput in tokens/s. # TYPE avg_generation_throughput_toks_per_s gauge avg_generation_throughput_toks_per_s{model_name="llama2-7b"} 2.375296831130981 # HELP failed_request_perc Requests failure rate. 1 means 100 percent usage. # TYPE failed_request_perc gauge failed_request_perc{model_name="llama2-7b"} 0 # HELP npu_cache_usage_perc NPU KV-cache usage. 1 means 100 percent usage. # TYPE npu_cache_usage_perc gauge npu_cache_usage_perc{model_name="llama2-7b"} 1 # HELP cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage. # TYPE cpu_cache_usage_perc gauge cpu_cache_usage_perc{model_name="llama2-7b"} 0 # HELP npu_prefix_cache_hit_rate NPU prefix cache block hit rate. # TYPE npu_prefix_cache_hit_rate gauge npu_prefix_cache_hit_rate{model_name="llama2-7b"} 0.5 # HELP time_to_first_token_seconds Histogram of time to first token in seconds. # TYPE time_to_first_token_seconds histogram time_to_first_token_seconds_count{model_name="llama2-7b"} 2523 time_to_first_token_seconds_sum{model_name="llama2-7b"} 9740.00200343132 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.001"} 0 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.005"} 0 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.01"} 0 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.02"} 0 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.04"} 0 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.06"} 10 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.08"} 54 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.1"} 104 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.25"} 256 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.5"} 256 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="0.75"} 276 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="1"} 321 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="2.5"} 628 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="5"} 1148 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="7.5"} 2523 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="10"} 2523 time_to_first_token_seconds_bucket{model_name="llama2-7b",le="+Inf"} 2523 # HELP time_per_output_token_seconds Histogram of time per output token in seconds. # TYPE time_per_output_token_seconds histogram time_per_output_token_seconds_count{model_name="llama2-7b"} 85800 time_per_output_token_seconds_sum{model_name="llama2-7b"} 4445.857012826018 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.01"} 0 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.025"} 0 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.05"} 3 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.075"} 12 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.1"} 40283 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.15"} 83145 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.2"} 83339 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.3"} 83339 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.4"} 83539 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.5"} 85139 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="0.75"} 85740 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="1"} 85800 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="2.5"} 85800 time_per_output_token_seconds_bucket{model_name="llama2-7b",le="+Inf"} 85800 # HELP e2e_request_latency_seconds Histogram of end to end request latency in seconds. # TYPE e2e_request_latency_seconds histogram e2e_request_latency_seconds_count{model_name="llama2-7b"} 2267 e2e_request_latency_seconds_sum{model_name="llama2-7b"} 12684.5319980979 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="1"} 27 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="2.5"} 268 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="5"} 712 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="10"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="15"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="20"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="30"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="40"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="50"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="60"} 2267 e2e_request_latency_seconds_bucket{model_name="llama2-7b",le="+Inf"} 2267 # HELP request_prompt_tokens Number of prefill tokens processed. # TYPE request_prompt_tokens histogram request_prompt_tokens_count{model_name="llama2-7b"} 3188 request_prompt_tokens_sum{model_name="llama2-7b"} 9564 request_prompt_tokens_bucket{model_name="llama2-7b",le="10"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="50"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="100"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="200"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="500"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="1000"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="2000"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="5000"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="10000"} 3188 request_prompt_tokens_bucket{model_name="llama2-7b",le="+Inf"} 3188 # HELP request_generation_tokens Number of generation tokens processed. # TYPE request_generation_tokens histogram request_generation_tokens_count{model_name="llama2-7b"} 2267 request_generation_tokens_sum{model_name="llama2-7b"} 84425 request_generation_tokens_bucket{model_name="llama2-7b",le="10"} 0 request_generation_tokens_bucket{model_name="llama2-7b",le="50"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="100"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="200"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="500"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="1000"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="2000"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="5000"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="10000"} 2267 request_generation_tokens_bucket{model_name="llama2-7b",le="+Inf"} 2267
指标种类和内容由MindIE Server决定,Coordinator只负责汇总集群下各节点的指标。
参数 |
类型 |
说明 |
---|---|---|
request_received_total |
Counter |
服务端截至目前为止接收到的请求个数。 model_name:使用的模型名称。string类型,如果有多个模型,请使用"&"进行拼接。 说明:
响应样例中的所有model_name都为此含义。 |
request_success_total |
Counter |
服务端截至目前为止执行成功的请求个数。 |
request_failed_total |
Counter |
服务端到目前为止推理失败的请求个数。 |
num_requests_running |
Gauge |
服务端当前正在执行的请求个数。 |
num_requests_waiting |
Gauge |
服务端当前等待调度执行的请求个数。 |
num_requests_swapped |
Gauge |
服务端当前被交换到CPU上的请求个数。 |
num_preemptions_total |
Counter |
服务端截至目前为止,累计执行请求抢占次数。 |
prompt_tokens_total |
Counter |
已经处理的prefill tokens数量。 |
generation_tokens_total |
Counter |
已经处理的generation tokens数量。 |
avg_prompt_throughput_toks_per_s |
Gauge |
截止上一个完成的请求为止,最新的平均prefill吞吐,单位为tokens/s。 |
avg_generation_throughput_toks_per_s |
Gauge |
截止上一个生成的token为止,最新的平均generation吞吐,单位为tokens/s。 |
failed_request_perc |
Gauge |
服务端截至目前为止执行失败的请求率,1代表100%。 |
npu_cache_usage_perc |
Gauge |
截止上一个完成的请求为止,最新的NPU显存占用率,1代表100%。 |
cpu_cache_usage_perc |
Gauge |
截止上一个完成的请求为止,最新的CPU显存占用率,1代表100%。 |
npu_prefix_cache_hit_rate |
Gauge |
NPU卡上prefix cache命中率,1代表100%。 |
time_to_first_token_seconds |
Histogram |
首token时延,代表请求生成首个推理token消耗的时间,单位为秒。
说明:
响应样例中的所有le都为此含义。 |
time_per_output_token_seconds |
Histogram |
token生成时延,代表连续两个token生成之间的时间间隔,单位为秒。
|
e2e_request_latency_seconds |
Histogram |
端到端时延,代表请求从接收到执行完成消耗的时间,单位为秒。
|
request_prompt_tokens |
Histogram |
请求输入的token数量,代表请求输入的prompt经过tokenzier之后得到的token个数。
|
request_generation_tokens |
Histogram |
请求输出的token数量,代表请求经过模型推理之后得到的token个数。
|