推理接口

运行环境的transformers版本不可低于4.34.1，低版本tokenizer不支持"chat_template"方法。
推理模型权重路径下的tokenizer_config.json需要包含"chat_template"字段及其实现。
function call功能的相关参数tool_call_id、tool_calls、tools和tool_choice当前仅支持部分模型，使用其他模型可能会报错。目前支持的模型只有ChatGLM3-6B。

接口功能

提供文本/流式推理处理功能。

接口格式

操作类型：POST

URL：https://{ip}:{port}/v1/chat/completions

{ip}和{port}请使用业务面的IP地址和端口号，即“ipAddress”和“port”。

请求参数

参数	是否必选	说明	取值要求
model	必选	模型名。	与MindIE Server配置文件中modelName的取值保持一致。
messages	必选	推理请求消息结构。	list类型，0KB<messages内容包含的字符数<=4MB，支持中英文。tokenizer之后的token数量小于或等于maxInputTokenLen、maxSeqLen-1、max_position_embeddings和1MB之间的最小值。其中，max_position_embeddings从权重文件config.json中获取，其他相关参数从配置文件中获取）。
role	必选	推理请求消息角色。	字符串类型，可取角色有： system：系统角色 user：用户角色 assistant：助手角色 tool：工具角色。使用该角色时，必须要传入tool_call_id。
content	必选	推理请求文本。	字符串类型。当role为assistant，且tool_calls非空时，content可以不传，其余角色非空。其余情况content非空。
tool_calls	可选	模型生成的工具调用。	类型为List[dict]，当role为assistant时，表示模型对工具的调用。
tool_calls.function	必选	表示模型调用的工具。	dict类型。 arguments，必选，使用JSON格式的字符串，表示调用函数的参数。 name，必选，字符串，调用的函数名。
tool_calls.id	必选	表示模型某次工具调用的ID。	字符串。
tool_calls.type	必选	调用的工具类型。	字符串，仅支持"function"。
tool_call_id	当role为tool时必选，否则可选	关联模型某次调用工具时的ID。	字符串。
stream	可选	指定返回结果是文本推理还是流式推理。	bool类型参数，可以为null，默认值false。 true：流式推理。 false：文本推理。
presence_penalty	可选	存在惩罚介于-2.0和2.0之间，它影响模型如何根据到目前为止是否出现在文本中来惩罚新token。正值将通过惩罚已经使用的词，增加模型谈论新主题的可能性。不建议同时修改该值与repetition_penalty或frequency_penalty。	float类型，取值范围[-2.0, 2.0]，默认值0.0，可以为null。
frequency_penalty	可选	频率惩罚介于-2.0和2.0之间，它影响模型如何根据文本中词汇的现有频率惩罚新词汇。正值将通过惩罚已经频繁使用的词来降低模型一行中重复用词的可能性。不建议同时修改该值与repetition_penalty或presence_penalty。	float类型，取值范围[-2.0, 2.0]，默认值0.0，可以为null。
repetition_penalty	可选	重复惩罚用于减少在文本生成过程中出现重复片段的概率。它对之前已经生成的文本进行惩罚，使得模型更倾向于选择新的、不重复的内容。不建议同时修改该值与presence_penalty或frequency_penalty。	float类型，取值范围(0.0, 2.0]，默认值1.0，可以为null。
temperature	可选	控制生成的随机性，较高的值会产生更多样化的输出。不建议同时修改该值与top_p。	float类型，取值范围大于或等于0.0，默认值1.0，可以为null。取值越大，结果的随机性越大。推荐使用大于或等于0.001的值，小于0.001可能会导致文本质量不佳。
top_p	可选	控制模型生成过程中考虑的词汇范围，使用累计概率选择候选词，直到累计概率超过给定的阈值。该参数也可以控制生成结果的多样性，它基于累积概率选择候选词，直到累计概率超过给定的阈值为止。不建议同时修改该值与temperature。	float类型，取值范围(0.0, 1.0]，默认值1.0，可以为null。
top_k	可选	控制模型生成过程中考虑的词汇范围，只从概率最高的k个候选词中选择。	int32类型，取值范围-1或（0, 2147483647]，可以为null。字段未设置时，默认值由后端模型确定，详情请参见说明。取值大于或等于vocabSize时，默认值为vocabSize。若传-1，-1会变为0传递给MindIE LLM后端，MindIE LLM后端会当做词表大小vocabSize来处理。 vocabSize是从modelWeightPath路径下的config.json文件中读取的vocab_size或者padded_vocab_size的值。建议用户在config.json文件中添加vocab_size或者padded_vocab_size参数，否则可能导致推理失败。
seed	可选	用于指定推理过程的随机种子，相同的seed值可以确保推理结果的可重现性，不同的seed值会提升推理结果的随机性。	uint64类型，取值范围[0, 18446744073709551615]，可以为null。不传递该参数，系统会产生一个随机seed值。当seed取到临近最大值时，会有WARNING，但并不会影响使用。若想去掉WARNING，可以减小seed取值。
stop	可选	停止推理的文本。输出结果默认不包含停止词列表文本。	List[string]类型或者string类型，默认值null。 List[string]类型，每个元素字符长度大于或等于1，列表元素总长度不超过32768（32*1024）。列表为空时相当于null。 string类型长度范围为1~32768。
stop_token_ids	可选	停止推理的token ID列表。输出结果不包含停止推理列表中的token ID。	List[int32]类型，默认值null。若该字段值非null，列表中元素不能为null，超出int32的元素将会被忽略。
include_stop_str_in_output	可选	决定是否在生成的推理文本中包含停止字符串。	bool类型，可以为null，默认值false。PD场景暂不支持此参数。 true：包含停止字符串。 false：不包含停止字符串。不传入stop或stop_token_ids时，此字段会被忽略。
skip_special_tokens	可选	指定在推理生成的文本中是否跳过特殊tokens。	bool类型，可以为null，默认值true。 true：跳过特殊tokens。 false：保留特殊tokens。
ignore_eos	可选	指定在推理文本生成过程中是否忽略eos_token结束符。	bool类型，可以为null，默认值false。 true：忽略eos_token结束符。 false：不忽略eos_token结束符。
max_tokens	可选	允许推理生成的最大token个数。实际产生的token数量同时受到配置文件maxIterTimes参数影响，推理token个数小于或等于Min(maxIterTimes, max_tokens)。	整型，取值范围(0，2147483647]，可以为null，默认值为MindIE Server配置文件中的maxIterTimes。
tools	可选	可能会使用的工具列表。	List[dict]类型，默认值null。
tools.type	必选	说明工具类型。	仅支持字符串"function"。
tools.function	必选	函数描述。	dict类型。
function.name	必选	函数名称。	字符串。
function.strict	可选	表示生成tool calls是否严格遵循schema格式。	bool类型，默认false。
function.description	可选	描述函数功能和使用。	字符串。
function.parameters	可选	表示函数接受的参数。	JSON schema格式。
parameters.type	必选	表示函数参数属性的类型。	字符串，仅支持object。
parameters.properties	必选	函数参数的属性。每一个key表示一个参数名，由用户自定义。value为dict类型，表示参数描述，包含type和description两个参数。	dict类型。
function.required	必选	表示函数必填参数列表。	List[string]类型。
function.additionalProperties	可选	是否允许使用未提及的额外参数。	bool类型，默认值false。 true：允许使用未提及的额外参数。 false：不允许使用未提及的额外参数。
tool_choice	可选	控制模型调用工具。	string类型或者dict类型，可以为null，默认值"auto"。 "none"：表示模型不会调用任何工具，而是生成一条消息。 "auto"：表示模型可以生成消息或调用一个或多个工具。 "required"：表示模型必须调用一个或多个工具。通过{"type": "function", "function": {"name": "my_function"}}指定特定的工具，将强制模型调用该工具。

使用样例

请求样例：

POST https://{ip}:{port}/v1/chat/completions

请求消息体：

单轮对话

{
    "model": "gpt-3.5-turbo",
    "messages": [{
        "role": "user",
        "content": "You are a helpful assistant."
    }],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": -1,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 20
}

多轮对话：

请求样例1：

{
    "model": "gpt-3.5-turbo",
    "messages": [{
        "role": "system",
        "content": "You are a student who is good at math."
        },
        {
        "role": "user",
        "content": "Hi, can you tell me the delivery date for my order? my order id is 12345."
        }
    ],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": -1,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "ignore_eos": false,
    "max_tokens": 1024,
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_delivery_date",
                "strict": true,
                "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "order_id": {
                            "type": "string",
                            "description": "The customer's order ID."
                        }
                    },
                "required": [
                    "order_id"
                ],
                "additionalProperties": false
                }
            }
        }
    ],
    "tool_choice": "auto"
}

请求样例2：

{
    "model": "llama_65b",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user."
        },
        {
            "role": "user",
            "content": "Hi, can you tell me the delivery date for my order? my order id is 12345."
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "function": {
                        "arguments": "{\"order_id\": \"12345\"}",
                        "name": "get_delivery_date"
                    },
                    "id": "tool_call_8p2Nk",
                    "type": "function"
                }
            ]
        },
        {
            "role": "tool",
            "content": "the delivery date is 2024.09.10.",
            "tool_call_id": "tool_call_8p2Nk"
        }
    ],
    "stream": false,
    "repetition_penalty": 1.1,
    "temperature": 0.9,
    "top_p": 1,
    "max_tokens": 1024,
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_delivery_date",
                "strict": true,
                "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "order_id": {
                            "type": "string",
                            "description": "The customer's order ID."
                        }
                    },
                    "required": [
                        "order_id"
                    ],
                    "additionalProperties": false
                }
            }
        }
    ],
    "tool_choice": "auto"
}

响应样例：

文本推理（“stream”=false）：

单轮对话：

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "gpt-3.5-turbo-0613",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\n\nHello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

多轮对话：

请求样例1：

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "gpt-3.5-turbo-0613",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "function": {
                            "arguments": "{\"order_id\": \"12345\"}",
                            "name": "get_delivery_date"
                        },
                        "id": "call_JwmTNF3O",
                        "type": "function"
                    }
                ]
            },
            "finish_reason": "tool_calls"
        }
    ],
    "usage": {
        "prompt_tokens": 226,
        "completion_tokens": 122,
        "total_tokens": 348
    }
}

请求样例2：

{
    "id": "endpoint_common_25",
    "object": "chat.completion",
    "created": 1728959154,
    "model": "llama_65b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\n Your order with ID 12345 is scheduled for delivery on September 10th, 2024.",
                "tool_calls": null
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 265,
        "completion_tokens": 30,
        "total_tokens": 295
    }
}

流式推理：

流式推理1（“stream”=true，使用sse格式返回）：

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_7","object":"chat.completion.chunk","created":1729614349,"model":"llama_65b","usage":{"prompt_tokens":54,"completion_tokens":13,"total_tokens":67},"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: [DONE]

流式推理2（“stream”=true，配置项“fullTextEnabled”=true，使用sse格式返回）：

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello!"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","full_text":"Hello! How can I assist you today?","usage":{"prompt_tokens":31,"completion_tokens":10,"total_tokens":41},"choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":"length"}]}

data: [DONE]

输出说明

表1 文本推理结果说明
参数名	类型	说明
id	string	请求ID。
object	string	返回结果类型目前都返回"chat.completion"。
created	integer	推理请求时间戳，精确到秒。
model	string	使用的推理模型。
choices	list	推理结果列表。
index	integer	choices消息index，当前只能为0。
message	object	推理消息。
role	string	角色，目前都返回"assistant"。
content	string	推理文本结果。
tool_calls	list	模型工具调用输出。
function	dict	函数调用说明。
arguments	string	调用函数的参数，JSON格式的字符串。
name	string	调用的函数名。
tool_calls.id	string	模型调用工具的ID。
type	string	工具的类型，目前仅支持function。
finish_reason	string	结束原因。 stop：请求被主动CANCEL或STOP，用户不感知，丢弃响应。请求执行中出错，响应输出为空，err_msg非空。请求输入校验异常，响应输出为空，err_msg非空。请求遇eos结束符正常结束。 length：请求因达到最大序列长度而结束，响应为最后一轮迭代输出。请求因达到最大输出长度（包括请求和模型粒度）而结束，响应为最后一轮迭代输出。 tool_calls：表示模型调用了工具。
usage	object	推理结果统计数据。
prompt_tokens	int	用户输入的prompt文本对应的token长度。
completion_tokens	int	推理token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时，D节点响应中completion_tokens数量为maxIterTimes+1，即增加了P推理结果的首token数量。
total_tokens	int	请求和推理的总token数。

表2 流式推理结果说明
参数名	类型	说明
data	object	一次推理返回的结果。
id	string	请求ID。
object	string	目前都返回"chat.completion.chunk"。
created	integer	推理请求时间戳，精确到秒。
model	string	使用的推理模型。
full_text	string	全量文本结果，配置项“fullTextEnabled”=true时才有此返回值。
usage	object	推理结果统计数据。
prompt_tokens	int	用户输入的prompt文本对应的token长度。
completion_tokens	int	推理token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时，D节点响应中completion_tokens数量为maxIterTimes+1，即增加了P推理结果的首token数量。
total_tokens	int	请求和推理的总token数。
choices	list	流式推理结果。
finish_reason	string	结束原因，只在最后一次推理结果返回。 stop：请求被主动CANCEL或STOP，用户不感知，丢弃响应。请求执行中出错，响应输出为空，err_msg非空。请求输入校验异常，响应输出为空，err_msg非空。请求遇eos结束符正常结束。 length：请求因达到最大序列长度而结束，响应为最后一轮迭代输出。请求因达到最大输出长度（包括请求和模型粒度）而结束，响应为最后一轮迭代输出。
index	integer	choices消息index，当前只能为0。
delta	object	推理返回结果。
role	string	角色，目前都返回"assistant"。
content	string	推理文本结果。

父主题： vLLM兼容OpenAI接口