Skip to content

[RL] support abort all and fix abort on waiting queue #6855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

zhuzilin
Copy link
Collaborator

@zhuzilin zhuzilin commented Jun 4, 2025

Motivation

This PR is trying to achieve 2 goals:

  1. Abort all requests when sending an empty rid in /abort_request.
  2. Make sure when spontaneously aborting requests in waiting queue, we can recieve an empty result from /generate without making sglang do prefill on the requests.

Thanks @yitianlian for his contribution on this design.

Modifications

Note that we always return "text": "" for requests in waiting queue, maybe we should support the case where user hoping for "output_ids": []?

Thank you for your time on reviewing this PR :)

Checklist

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zhuzilin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR aims to enhance the request abortion functionality. Specifically, it introduces the ability to abort all pending and running requests by sending an empty request ID (rid) to the /abort_request endpoint. Additionally, it refines the handling of requests that are aborted while still in the waiting queue, ensuring they are properly marked as finished and return an empty result with an 'abort' finish reason, preventing unnecessary prefill operations.

Highlights

  • Abort All Requests: Adds support for aborting all requests (both in the waiting queue and currently running/processing) when the /abort_request endpoint receives a request with an empty rid.
  • Improved Waiting Queue Abort Handling: Modifies the internal handling of requests aborted while in the waiting queue. Instead of simply removing them, they are now explicitly marked as finished, and a specific response indicating an 'abort' finish reason is generated and returned, avoiding subsequent processing like prefill.
  • API Protocol Update: Updates the OpenAI API protocol definitions to include 'abort' as a valid finish_reason for both completion and chat completion stream responses.

Changelog

  • python/sglang/srt/managers/scheduler.py
    • Added logic to the abort_request method (lines 2036-2038) to identify all requests in the waiting queue for deletion if the received rid is empty.
    • Added logic to the abort_request method (lines 2055-2057) to mark all non-finished requests in the running and current batches for abortion if the received rid is empty.
  • python/sglang/srt/managers/tokenizer_manager.py
    • Modified the abort_request method (line 788) to allow an empty rid to bypass the check if the rid exists in rid_to_state, enabling the 'abort all' signal to be sent to the scheduler.
    • Rewrote the _handle_abort_req method (lines 1420-1436) to explicitly set the request state to finished, append a result dictionary with an empty text and an 'abort' finish reason, and set the event, instead of just popping the state. This ensures proper cleanup and client notification for requests aborted before prefill.
  • python/sglang/srt/openai_api/protocol.py
    • Added 'abort' to the Literal type for finish_reason in the CompletionResponseStreamChoice model (line 213).
    • Added 'abort' to the Literal type for finish_reason in the ChatCompletionResponseStreamChoice model (line 448).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Requests line up in queue,
Some run, some wait for you.
Send empty ID,
All set free,
Aborted, fresh and new.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two valuable features: the ability to abort all requests using an empty rid and ensuring that requests aborted from the waiting queue return an empty result without prefill. The changes in scheduler.py and openai_api/protocol.py look good and directly address these goals.

I've identified a couple of areas in tokenizer_manager.py that could be improved, particularly regarding state cleanup and the accuracy of reported prompt_tokens for aborted requests. Additionally, adding unit tests for the new abort functionalities would be beneficial for long-term maintainability.

Overall, good work on implementing these features!

Summary of Findings

  • Potential Memory Leak in Abort Handling: In tokenizer_manager.py, the _handle_abort_req method marks aborted requests as finished but doesn't remove their state from self.rid_to_state. This could potentially lead to a memory leak over time as aborted request states accumulate. (Commented with high severity)
  • Accuracy of prompt_tokens for Aborted Requests: For requests aborted before prefill, _handle_abort_req in tokenizer_manager.py sets prompt_tokens to 1. This might not accurately reflect the number of tokens processed by the model, potentially impacting metrics. Suggesting to set it to 0 or the actual (unprocessed) prompt length. (Commented with medium severity)
  • Unit Testing: The PR introduces new functionality for aborting requests (abort all, abort from waiting queue). It would be beneficial to add unit tests to cover these new behaviors to ensure correctness and prevent regressions. (Not commented directly as per severity settings, but important for code quality)
  • API Documentation: The /abort_request endpoint now supports a new behavior (empty rid for aborting all requests). This change should be documented for API users. (Not commented directly as per severity settings, but important for usability)

Merge Readiness

This pull request makes good progress on enhancing request abortion capabilities. However, there are a couple of issues identified (a potential memory leak and metrics accuracy for prompt_tokens) that should be addressed before merging. The memory leak is a high-severity concern. Additionally, consider adding unit tests for the new abort functionalities and updating any relevant API documentation.

I am unable to approve pull requests, but I recommend addressing the high and medium severity comments before this PR is merged by other reviewers.

Comment on lines 1420 to 1436
state = self.rid_to_state[recv_obj.rid]
state.finished = True
state.out_list.append(
{
"text": "",
"meta_info": {
"id": recv_obj.rid,
"finish_reason": {
"type": "abort",
"message": "Abort before prefill",
},
"prompt_tokens": 1,
"completion_tokens": 0,
},
}
)
state.event.set()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It appears that when a request is handled by _handle_abort_req, its state is marked as finished, but it's not removed from self.rid_to_state. In contrast, _handle_batch_output does remove finished requests (e.g., del self.rid_to_state[rid] around line 1223 in the full file).

Could this lead to a memory leak if aborted request states accumulate in self.rid_to_state? If so, should del self.rid_to_state[recv_obj.rid] be added at the end of this handler, similar to how normally completed requests are handled?

"type": "abort",
"message": "Abort before prefill",
},
"prompt_tokens": 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The meta_info for an aborted request sets "prompt_tokens": 1. If a request is aborted "before prefill", it's likely that its prompt tokens haven't been processed by the core inference engine.

Would it be more accurate to set "prompt_tokens": 0 here, or perhaps use the actual length of the input prompt if it's readily available (e.g., from state.obj.input_ids if tokenized, or 0 if not yet tokenized/processed)? Using 1 might be misleading for metrics or accounting if no tokens were actually processed by the model.

Suggested change
"prompt_tokens": 1,
"prompt_tokens": 0,

@zhuzilin zhuzilin changed the title [rl] support abort all and fix abort on waiting queue [RL] support abort all and fix abort on waiting queue Jun 21, 2025
@zhyncs
Copy link
Member

zhyncs commented Jun 23, 2025

please rebase

@@ -2033,6 +2033,9 @@ def abort_request(self, recv_req: AbortReq):
# Delete requests in the waiting queue
to_del = []
for i, req in enumerate(self.waiting_queue):
if recv_req.rid == "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a constant ABORT_ALL_RID instead of ""

},
}
)
state.event.set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this make sure the state is deleted in self.rid_to_state?

@zhaochenyang20
Copy link
Collaborator

This PR will be discarded and use #6698 instead. Thanks for contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants