-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[RL] support abort all and fix abort on waiting queue #6855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zhuzilin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR aims to enhance the request abortion functionality. Specifically, it introduces the ability to abort all pending and running requests by sending an empty request ID (rid
) to the /abort_request
endpoint. Additionally, it refines the handling of requests that are aborted while still in the waiting queue, ensuring they are properly marked as finished and return an empty result with an 'abort' finish reason, preventing unnecessary prefill operations.
Highlights
- Abort All Requests: Adds support for aborting all requests (both in the waiting queue and currently running/processing) when the
/abort_request
endpoint receives a request with an emptyrid
. - Improved Waiting Queue Abort Handling: Modifies the internal handling of requests aborted while in the waiting queue. Instead of simply removing them, they are now explicitly marked as finished, and a specific response indicating an 'abort' finish reason is generated and returned, avoiding subsequent processing like prefill.
- API Protocol Update: Updates the OpenAI API protocol definitions to include 'abort' as a valid
finish_reason
for both completion and chat completion stream responses.
Changelog
- python/sglang/srt/managers/scheduler.py
- Added logic to the
abort_request
method (lines 2036-2038) to identify all requests in the waiting queue for deletion if the receivedrid
is empty. - Added logic to the
abort_request
method (lines 2055-2057) to mark all non-finished requests in the running and current batches for abortion if the receivedrid
is empty.
- Added logic to the
- python/sglang/srt/managers/tokenizer_manager.py
- Modified the
abort_request
method (line 788) to allow an emptyrid
to bypass the check if therid
exists inrid_to_state
, enabling the 'abort all' signal to be sent to the scheduler. - Rewrote the
_handle_abort_req
method (lines 1420-1436) to explicitly set the request state to finished, append a result dictionary with an empty text and an 'abort' finish reason, and set the event, instead of just popping the state. This ensures proper cleanup and client notification for requests aborted before prefill.
- Modified the
- python/sglang/srt/openai_api/protocol.py
- Added 'abort' to the
Literal
type forfinish_reason
in theCompletionResponseStreamChoice
model (line 213). - Added 'abort' to the
Literal
type forfinish_reason
in theChatCompletionResponseStreamChoice
model (line 448).
- Added 'abort' to the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Requests line up in queue,
Some run, some wait for you.
Send empty ID,
All set free,
Aborted, fresh and new.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two valuable features: the ability to abort all requests using an empty rid
and ensuring that requests aborted from the waiting queue return an empty result without prefill. The changes in scheduler.py
and openai_api/protocol.py
look good and directly address these goals.
I've identified a couple of areas in tokenizer_manager.py
that could be improved, particularly regarding state cleanup and the accuracy of reported prompt_tokens
for aborted requests. Additionally, adding unit tests for the new abort functionalities would be beneficial for long-term maintainability.
Overall, good work on implementing these features!
Summary of Findings
- Potential Memory Leak in Abort Handling: In
tokenizer_manager.py
, the_handle_abort_req
method marks aborted requests as finished but doesn't remove their state fromself.rid_to_state
. This could potentially lead to a memory leak over time as aborted request states accumulate. (Commented withhigh
severity) - Accuracy of
prompt_tokens
for Aborted Requests: For requests aborted before prefill,_handle_abort_req
intokenizer_manager.py
setsprompt_tokens
to 1. This might not accurately reflect the number of tokens processed by the model, potentially impacting metrics. Suggesting to set it to 0 or the actual (unprocessed) prompt length. (Commented withmedium
severity) - Unit Testing: The PR introduces new functionality for aborting requests (abort all, abort from waiting queue). It would be beneficial to add unit tests to cover these new behaviors to ensure correctness and prevent regressions. (Not commented directly as per severity settings, but important for code quality)
- API Documentation: The
/abort_request
endpoint now supports a new behavior (emptyrid
for aborting all requests). This change should be documented for API users. (Not commented directly as per severity settings, but important for usability)
Merge Readiness
This pull request makes good progress on enhancing request abortion capabilities. However, there are a couple of issues identified (a potential memory leak and metrics accuracy for prompt_tokens
) that should be addressed before merging. The memory leak is a high-severity concern. Additionally, consider adding unit tests for the new abort functionalities and updating any relevant API documentation.
I am unable to approve pull requests, but I recommend addressing the high and medium severity comments before this PR is merged by other reviewers.
state = self.rid_to_state[recv_obj.rid] | ||
state.finished = True | ||
state.out_list.append( | ||
{ | ||
"text": "", | ||
"meta_info": { | ||
"id": recv_obj.rid, | ||
"finish_reason": { | ||
"type": "abort", | ||
"message": "Abort before prefill", | ||
}, | ||
"prompt_tokens": 1, | ||
"completion_tokens": 0, | ||
}, | ||
} | ||
) | ||
state.event.set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that when a request is handled by _handle_abort_req
, its state is marked as finished
, but it's not removed from self.rid_to_state
. In contrast, _handle_batch_output
does remove finished requests (e.g., del self.rid_to_state[rid]
around line 1223 in the full file).
Could this lead to a memory leak if aborted request states accumulate in self.rid_to_state
? If so, should del self.rid_to_state[recv_obj.rid]
be added at the end of this handler, similar to how normally completed requests are handled?
"type": "abort", | ||
"message": "Abort before prefill", | ||
}, | ||
"prompt_tokens": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta_info
for an aborted request sets "prompt_tokens": 1
. If a request is aborted "before prefill", it's likely that its prompt tokens haven't been processed by the core inference engine.
Would it be more accurate to set "prompt_tokens": 0
here, or perhaps use the actual length of the input prompt if it's readily available (e.g., from state.obj.input_ids
if tokenized, or 0 if not yet tokenized/processed)? Using 1
might be misleading for metrics or accounting if no tokens were actually processed by the model.
"prompt_tokens": 1, | |
"prompt_tokens": 0, |
please rebase |
@@ -2033,6 +2033,9 @@ def abort_request(self, recv_req: AbortReq): | |||
# Delete requests in the waiting queue | |||
to_del = [] | |||
for i, req in enumerate(self.waiting_queue): | |||
if recv_req.rid == "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a constant ABORT_ALL_RID
instead of ""
}, | ||
} | ||
) | ||
state.event.set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this make sure the state is deleted in self.rid_to_state
?
This PR will be discarded and use #6698 instead. Thanks for contribution. |
Motivation
This PR is trying to achieve 2 goals:
rid
in/abort_request
./generate
without making sglang do prefill on the requests.Thanks @yitianlian for his contribution on this design.
Modifications
Note that we always return
"text": ""
for requests in waiting queue, maybe we should support the case where user hoping for"output_ids": []
?Thank you for your time on reviewing this PR :)
Checklist