推理接口

- 运行环境的transformers版本不可低于4.34.1,低版本tokenizer不支持"chat_template"方法。
- 推理模型权重路径下的tokenizer_config.json需要包含"chat_template"字段及其实现。
- function call功能的相关参数tool_call_id、tool_calls、tools和tool_choice当前仅支持部分模型,使用其他模型可能会报错。目前支持的模型只有ChatGLM3-6B。
接口功能
提供文本/流式推理处理功能。
接口格式
操作类型:POST
URL:https://{ip}:{port}/v1/chat/completions

{ip}和{port}请使用业务面的IP地址和端口号,即“ipAddress”和“port”。
请求参数
参数 |
是否必选 |
说明 |
取值要求 |
---|---|---|---|
model |
必选 |
模型名。 |
与MindIE Server配置文件中modelName的取值保持一致。 |
messages |
必选 |
推理请求消息结构。 |
list类型,0KB<messages内容包含的字符数<=4MB,支持中英文。tokenizer之后的token数量小于或等于maxInputTokenLen、maxSeqLen-1、max_position_embeddings和1MB之间的最小值。其中,max_position_embeddings从权重文件config.json中获取,其他相关参数从配置文件中获取。 |
role |
必选 |
推理请求消息角色。 |
字符串类型,可取角色有:
|
content |
必选 |
推理请求文本。 |
字符串类型。
|
tool_calls |
可选 |
模型生成的工具调用。 |
类型为List[dict],当role为assistant时,表示模型对工具的调用。 |
tool_calls.function |
必选 |
表示模型调用的工具。 |
dict类型。
|
tool_calls.id |
必选 |
表示模型某次工具调用的ID。 |
字符串。 |
tool_calls.type |
必选 |
调用的工具类型。 |
字符串,仅支持"function"。 |
tool_call_id |
当role为tool时必选,否则可选 |
关联模型某次调用工具时的ID。 |
字符串。 |
stream |
可选 |
指定返回结果是文本推理还是流式推理。 |
bool类型参数,默认值false。
|
presence_penalty |
可选 |
存在惩罚介于-2.0和2.0之间,它影响模型如何根据到目前为止是否出现在文本中来惩罚新token。正值将通过惩罚已经使用的词,增加模型谈论新主题的可能性。 |
float类型,取值范围[-2.0, 2.0],默认值0.0。 |
frequency_penalty |
可选 |
频率惩罚介于-2.0和2.0之间,它影响模型如何根据文本中词汇的现有频率惩罚新词汇。正值将通过惩罚已经频繁使用的词来降低模型一行中重复用词的可能性。 |
float类型,取值范围[-2.0, 2.0],默认值0.0。 |
repetition_penalty |
可选 |
重复惩罚是一种技术,用于减少在文本生成过程中出现重复片段的概率。它对之前已经生成的文本进行惩罚,使得模型更倾向于选择新的、不重复的内容。 |
float类型,取值范围(0.0, 2.0],默认值1.0。 |
temperature |
可选 |
控制生成的随机性,较高的值会产生更多样化的输出。 |
float类型,取值范围[0.0, 2.0],默认值1.0。 取值越大,结果的随机性越大。推荐使用大于或等于0.001的值,小于0.001可能会导致文本质量不佳。 |
top_p |
可选 |
控制模型生成过程中考虑的词汇范围,使用累计概率选择候选词,直到累计概率超过给定的阈值。该参数也可以控制生成结果的多样性,它基于累积概率选择候选词,直到累计概率超过给定的阈值为止。 |
float类型,取值范围(0.0, 1.0],默认值1.0。 |
top_k |
可选 |
控制模型生成过程中考虑的词汇范围,只从概率最高的k个候选词中选择。 |
int32类型,取值范围[0, 2147483647],字段未设置时,默认值由后端模型确定,详情请参见说明。取值大于或等于vocabSize时,默认值为vocabSize。 vocabSize是从modelWeightPath路径下的config.json文件中读取的vocab_size或者padded_vocab_size的值。建议用户在config.json文件中添加vocab_size或者padded_vocab_size参数,否则可能导致推理失败。 |
seed |
可选 |
用于指定推理过程的随机种子,相同的seed值可以确保推理结果的可重现性,不同的seed值会提升推理结果的随机性。 |
uint_64类型,取值范围[0, 18446744073709551615],不传递该参数,系统会产生一个随机seed值。 当seed取到临近最大值时,会有WARNING,但并不会影响使用。若想去掉WARNING,可以减小seed取值。 |
stop |
可选 |
停止推理的文本。输出结果默认不包含停止词列表文本。 |
List[string]类型或者string类型,默认值null。
|
stop_token_ids |
可选 |
停止推理的token ID列表。输出结果默认不包含停止推理列表中的token ID。 |
List[int32]类型,超出int32的元素将会被忽略,默认值null。 |
include_stop_str_in_output |
可选 |
决定是否在生成的推理文本中包含停止字符串。 |
bool类型,默认值false。PD场景暂不支持此参数。
不传入stop或stop_token_ids时,此字段会被忽略。 |
skip_special_tokens |
可选 |
指定在推理生成的文本中是否跳过特殊tokens。 |
bool类型,默认值true。
|
ignore_eos |
可选 |
指定在推理文本生成过程中是否忽略eos_token结束符。 |
bool类型,默认值false。
|
max_tokens |
可选 |
允许推理生成的最大token个数。实际产生的token数量同时受到配置文件maxIterTimes参数影响,推理token个数小于或等于Min(maxIterTimes, max_tokens)。 |
int类型,取值范围(0,2147483647],默认值maxIterTimes。 |
tools |
可选 |
可能会使用的工具列表。 |
List[dict]类型。 |
tools.type |
必选 |
说明工具类型。 |
仅支持字符串"function"。 |
tools.function |
必选 |
函数描述。 |
dict类型。 |
function.name |
必选 |
函数名称。 |
字符串。 |
function.strict |
可选 |
表示生成tool calls是否严格遵循schema格式。 |
bool类型,默认false。 |
function.description |
可选 |
描述函数功能和使用。 |
字符串。 |
function.parameters |
可选 |
表示函数接受的参数。 |
JSON schema格式。 |
parameters.type |
必选 |
表示函数参数属性的类型。 |
字符串,仅支持object。 |
parameters.properties |
必选 |
函数参数的属性。每一个key表示一个参数名,由用户自定义。value为dict类型,表示参数描述,包含type和description两个参数。 |
dict类型。 |
function.required |
必选 |
表示函数必填参数列表。 |
List[string]类型。 |
function.additionalProperties |
可选 |
是否允许使用未提及的额外参数。 |
bool类型,默认值false。
|
tool_choice |
可选 |
控制模型调用工具。 |
string类型或者dict类型,可以为null,默认值"auto"。
通过{"type": "function", "function": {"name": "my_function"}}指定特定的工具,将强制模型调用该工具。 |
使用样例
请求样例:
POST https://{ip}:{port}/v1/chat/completions
- 单轮对话:
{ "model": "gpt-3.5-turbo", "messages": [{ "role": "user", "content": "You are a helpful assistant." }], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 0, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2, 13], "include_stop_str_in_output": false, "skip_special_tokens": true, "ignore_eos": false, "max_tokens": 20 }
- 多轮对话:
- 请求样例1:
{ "model": "chatglm3-6b", "messages": [{ "role": "system", "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user." }, { "role": "user", "content": "Hi, can you tell me the delivery date for my order? my order id is 12345." } ], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 0, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2], "ignore_eos": false, "max_tokens": 1024, "tools": [ { "type": "function", "function": { "name": "get_delivery_date", "strict": true, "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'", "parameters": { "type": "object", "properties": { "order_id": { "type": "string", "description": "The customer's order ID." } }, "required": [ "order_id" ], "additionalProperties": false } } } ], "tool_choice": "auto" }
- 请求样例2:
{ "model": "chatglm3-6b", "messages": [ { "role": "system", "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user." }, { "role": "user", "content": "Hi, can you tell me the delivery date for my order? my order id is 12345." }, { "role": "assistant", "tool_calls": [ { "function": { "arguments": "{\"order_id\": \"12345\"}", "name": "get_delivery_date" }, "id": "tool_call_8p2Nk", "type": "function" } ] }, { "role": "tool", "content": "the delivery date is 2024.09.10.", "tool_call_id": "tool_call_8p2Nk" } ], "stream": false, "repetition_penalty": 1.1, "temperature": 0.9, "top_p": 1, "max_tokens": 1024, "tools": [ { "type": "function", "function": { "name": "get_delivery_date", "strict": true, "description": "Get the delivery date for a customer's order. Call this whenever you need to know the delivery date, for example when a customer asks 'Where is my package'", "parameters": { "type": "object", "properties": { "order_id": { "type": "string", "description": "The customer's order ID." } }, "required": [ "order_id" ], "additionalProperties": false } } } ], "tool_choice": "auto" }
- 请求样例1:
- 文本推理(“stream”=false):
- 单轮对话:
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1677652288, "model": "gpt-3.5-turbo-0613", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "\n\nHello there, how may I assist you today?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 9, "completion_tokens": 12, "total_tokens": 21 } }
- 多轮对话:
- 响应样例1:
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1677652288, "model": "chatglm3-6b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "", "tool_calls": [ { "function": { "arguments": "{\"order_id\": \"12345\"}", "name": "get_delivery_date" }, "id": "call_JwmTNF3O", "type": "function" } ] }, "finish_reason": "tool_calls" } ], "usage": { "prompt_tokens": 226, "completion_tokens": 122, "total_tokens": 348 } }
- 响应样例2:
{ "id": "endpoint_common_25", "object": "chat.completion", "created": 1728959154, "model": "chatglm3-6b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "\n Your order with ID 12345 is scheduled for delivery on September 10th, 2024.", "tool_calls": null }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 265, "completion_tokens": 30, "total_tokens": 295 } }
- 响应样例1:
- 单轮对话:
- 流式推理:
- 流式推理1(“stream”=true,使用sse格式返回):
data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"llama_65b","usage":{"prompt_tokens":54,"completion_tokens":17,"total_tokens":71},"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]} data: [DONE]
- 流式推理2(“stream”=true,配置项“fullTextEnabled”=true,使用sse格式返回):
data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello!"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"llama_65b","full_text":"Hello! How can I assist you today?","usage":{"prompt_tokens":31,"completion_tokens":10,"total_tokens":41},"choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":"length"}]} data: [DONE]
- 流式推理1(“stream”=true,使用sse格式返回):
输出说明
参数名 |
类型 |
说明 |
---|---|---|
id |
string |
请求ID。 |
object |
string |
返回结果类型目前都返回"chat.completion"。 |
created |
integer |
推理请求时间戳,精确到秒。 |
model |
string |
使用的推理模型。 |
choices |
list |
推理结果列表。 |
index |
integer |
choices消息index,当前只能为0。 |
message |
object |
推理消息。 |
role |
string |
角色,目前都返回"assistant"。 |
content |
string |
推理文本结果。 |
tool_calls |
list |
模型工具调用输出。 |
function |
dict |
函数调用说明。 |
arguments |
string |
调用函数的参数,JSON格式的字符串。 |
name |
string |
调用的函数名。 |
tool_calls.id |
string |
模型调用工具的ID。 |
type |
string |
工具的类型,目前仅支持function。 |
finish_reason |
string |
结束原因。
|
usage |
object |
推理结果统计数据。 |
prompt_tokens |
int |
用户输入的prompt文本对应的token长度。 |
completion_tokens |
int |
推理结果token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时,D节点响应中completion_tokens数量为maxIterTimes+1,即增加了P推理结果的首token数量。 |
total_tokens |
int |
请求和推理的总token数。 |
参数名 |
类型 |
说明 |
---|---|---|
data |
object |
一次推理返回的结果。 |
id |
string |
请求ID。 |
object |
string |
目前都返回"chat.completion.chunk"。 |
created |
integer |
推理请求时间戳,精确到秒。 |
model |
string |
使用的推理模型。 |
full_text |
string |
全量文本结果,配置项“fullTextEnabled”=true时才有此返回值。 |
usage |
object |
推理结果统计数据。 |
prompt_tokens |
int |
用户输入的prompt文本对应的token长度。 |
completion_tokens |
int |
推理结果token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时,D节点响应中completion_tokens数量为maxIterTimes+1,即增加了P推理结果的首token数量。 |
total_tokens |
int |
请求和推理的总token数。 |
choices |
list |
流式推理结果。 |
index |
integer |
choices消息index,当前只能为0。 |
delta |
object |
推理返回结果,最后一个响应为空。 |
role |
string |
角色,目前都返回"assistant"。 |
content |
string |
推理文本结果。 |
finish_reason |
string |
结束原因,只在最后一次推理结果返回。
|