-
Notifications
You must be signed in to change notification settings - Fork 2.3k
fix: Fix deprecated max_tokens param in openai ChatCompletionRequest #3122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
403eb6c
to
eed4b36
Compare
Can you only update the |
Much thanks for addressing #3098 so quickly and FYI I've tested this branch and it does resolve the issue! |
eed4b36
to
91f7dba
Compare
Updated. Removed the frontend part |
91f7dba
to
b48b900
Compare
@mickqian mick I will take a look on this. Thanks! |
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering what's the definition of max_tokens
in non-chat model? And why we do not keep max_tokens
for non-chat models in the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the def of
max_tokens
in completion models is tokens that can be generated in the completion, pretty much the same asmax_completion_tokens
- Yes we should keep both params, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. For embedding model, “max_tokens” means the max sequence length that can be processed. What if it exceeds the length? Should the sequence be truncated or throw an error? The chat model also. And, I personally think we should call it generation model and embedding model. That's what we typically call these models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- for embedding models, I think both ways will do, depending on the situation. Providing an option sounds good too
- Yes,
completion model
andchat completion model
fall into the category ofgeneration model
, I thought you were referring tocompletion model
. Theis_chat_model
is used to distinguish different generation models, if I'm correct. For sglang.lang, does it involve embedding models(or did I miss something?)? If not, probablyis_chat_model
would suffice for generation models inbackend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This should be sound to me. But, in my ideal:
def to_openai_kwargs(self, is_chat_model):
You mean is_chat_model
is an element of class OpenAI(BaseBackend)
in python/sglang/lang/backend/openai.py
rather than an element of class SglSamplingParams
in python/sglang/lang/ir.py
?
So this function is not:
def to_openai_kwargs(self):
Right? That make sense. But I prefer the latter one if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's supposed to be an internal property of a BaseBackend
( as it directly describes the model ), and passed to SglSamplingParams
for generating openai-compatible request.
While adding this field to SglSamplingParams
do sounds good in some cases, I personally reckon SglSamplingParams
is meant to be a model-unaware data, which can be sent to different backends, and let the actual backend decides the final openai request? Feel free to correct me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Nice work.
b48b900
to
b80b526
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think we should comment on the
protocal.py
or anywhere regarding the definition ofmax_completion_tokens
and the difference before previousmax_tokens
. -
Could you run the docs CI locally, just
make compile
is enough. Current docs CI is closed due to long queue time to compile on CI. But we should run it locally.
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This should be sound to me. But, in my ideal:
def to_openai_kwargs(self, is_chat_model):
You mean is_chat_model
is an element of class OpenAI(BaseBackend)
in python/sglang/lang/backend/openai.py
rather than an element of class SglSamplingParams
in python/sglang/lang/ir.py
?
So this function is not:
def to_openai_kwargs(self):
Right? That make sense. But I prefer the latter one if we can.
$ jupyter nbconvert --to notebook --execute --inplace ./backend/openai_api_completions.ipynb \
--ExecutePreprocessor.timeout=600 \
--ExecutePreprocessor.kernel_name=python3 || exit 1;
...
AttributeError Traceback (most recent call last)
Cell In[9], line 58
52 batch_details = client.batches.retrieve(batch_id=batch_job.id)
54 print_highlight(
55 f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
56 )
57 print_highlight(
---> 58 f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
59 )
61 time.sleep(3)
AttributeError: 'NoneType' object has no attribute 'total' Is it a known issue, or the error is from my side? |
b80b526
to
82c817b
Compare
Fixed, and |
82c817b
to
e9aa94a
Compare
Will review it today. Stay tuned! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good improvment. I am wondering do openai still have max tokens for chat API right now? For chat completion api, there should only be max completion tokens. But I don't know what's for chat API.
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Nice work.
@@ -295,7 +295,12 @@ class ChatCompletionRequest(BaseModel): | |||
logit_bias: Optional[Dict[str, float]] = None | |||
logprobs: bool = False | |||
top_logprobs: Optional[int] = None | |||
# The maximum number of tokens that can be generated in the chat completion. | |||
# non-chat-completion models only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Make it a full sentence:
Non-chat-completion models only have max tokens.
-
So chat completion models count the
max_completion_tokens
and chat models (not completion models) count themax_token
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There's a nuance difference between
non-chat-completion models only
andNon-chat-completion models only have max tokens
, I'm afraid. Changed toOnly available for non-chat-completion models
, is that ok? - yes, to be more specific, in openai's legacy completion api(non-chat completion models only), they only have
max_tokens
. In their chat-completion api, they have both two params, but:

replied in here |
10ce5fd
to
db732ae
Compare
@mickqian This should be rebased. Thanks 😂 |
473fa26
to
353c258
Compare
@shuaills Could you review this? Thnaks! |
692e269
to
f132950
Compare
LGTM @zhaochenyang20 Can you help rerun the CI? |
@mickqian @shuaills keep this CI here. Shuai, plz approve at https://github.com/sgl-project/sglang/pull/3122/files I will rerun the CI |
10c16b3
to
01dab60
Compare
@mickqian will help to merge after the CI |
Replace it with a newer one: max_completion_tokens
01dab60
to
aeba666
Compare
@mickqian I've rerun the CI. Thanks so much. Do not rebase with main. We can merge this after the CI. |
is this PR gonna be merged soon? |
@mickqian Can you do this first? Use other parts for different PRs. Do not mix them together. |
Motivation
Address #3098
Modifications
max_tokens
param with a newer one:max_completion_tokens
, according to hereChecklist