[RL] support abort all and fix abort on waiting queue #6855

zhuzilin · 2025-06-04T02:47:18Z

Motivation

This PR is trying to achieve 2 goals:

Abort all requests when sending an empty rid in /abort_request.
Make sure when spontaneously aborting requests in waiting queue, we can recieve an empty result from /generate without making sglang do prefill on the requests.

Thanks @yitianlian for his contribution on this design.

Modifications

Note that we always return "text": "" for requests in waiting queue, maybe we should support the case where user hoping for "output_ids": []?

Thank you for your time on reviewing this PR :)

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @zhuzilin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR aims to enhance the request abortion functionality. Specifically, it introduces the ability to abort all pending and running requests by sending an empty request ID (rid) to the /abort_request endpoint. Additionally, it refines the handling of requests that are aborted while still in the waiting queue, ensuring they are properly marked as finished and return an empty result with an 'abort' finish reason, preventing unnecessary prefill operations.

Highlights

Abort All Requests: Adds support for aborting all requests (both in the waiting queue and currently running/processing) when the /abort_request endpoint receives a request with an empty rid.
Improved Waiting Queue Abort Handling: Modifies the internal handling of requests aborted while in the waiting queue. Instead of simply removing them, they are now explicitly marked as finished, and a specific response indicating an 'abort' finish reason is generated and returned, avoiding subsequent processing like prefill.
API Protocol Update: Updates the OpenAI API protocol definitions to include 'abort' as a valid finish_reason for both completion and chat completion stream responses.

Changelog

python/sglang/srt/managers/scheduler.py
- Added logic to the abort_request method (lines 2036-2038) to identify all requests in the waiting queue for deletion if the received rid is empty.
- Added logic to the abort_request method (lines 2055-2057) to mark all non-finished requests in the running and current batches for abortion if the received rid is empty.
python/sglang/srt/managers/tokenizer_manager.py
- Modified the abort_request method (line 788) to allow an empty rid to bypass the check if the rid exists in rid_to_state, enabling the 'abort all' signal to be sent to the scheduler.
- Rewrote the _handle_abort_req method (lines 1420-1436) to explicitly set the request state to finished, append a result dictionary with an empty text and an 'abort' finish reason, and set the event, instead of just popping the state. This ensures proper cleanup and client notification for requests aborted before prefill.
python/sglang/srt/openai_api/protocol.py
- Added 'abort' to the Literal type for finish_reason in the CompletionResponseStreamChoice model (line 213).
- Added 'abort' to the Literal type for finish_reason in the ChatCompletionResponseStreamChoice model (line 448).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Requests line up in queue,
Some run, some wait for you.
Send empty ID,
All set free,
Aborted, fresh and new.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two valuable features: the ability to abort all requests using an empty rid and ensuring that requests aborted from the waiting queue return an empty result without prefill. The changes in scheduler.py and openai_api/protocol.py look good and directly address these goals.

I've identified a couple of areas in tokenizer_manager.py that could be improved, particularly regarding state cleanup and the accuracy of reported prompt_tokens for aborted requests. Additionally, adding unit tests for the new abort functionalities would be beneficial for long-term maintainability.

Overall, good work on implementing these features!

Summary of Findings

Potential Memory Leak in Abort Handling: In tokenizer_manager.py, the _handle_abort_req method marks aborted requests as finished but doesn't remove their state from self.rid_to_state. This could potentially lead to a memory leak over time as aborted request states accumulate. (Commented with high severity)
Accuracy of prompt_tokens for Aborted Requests: For requests aborted before prefill, _handle_abort_req in tokenizer_manager.py sets prompt_tokens to 1. This might not accurately reflect the number of tokens processed by the model, potentially impacting metrics. Suggesting to set it to 0 or the actual (unprocessed) prompt length. (Commented with medium severity)
Unit Testing: The PR introduces new functionality for aborting requests (abort all, abort from waiting queue). It would be beneficial to add unit tests to cover these new behaviors to ensure correctness and prevent regressions. (Not commented directly as per severity settings, but important for code quality)
API Documentation: The /abort_request endpoint now supports a new behavior (empty rid for aborting all requests). This change should be documented for API users. (Not commented directly as per severity settings, but important for usability)

Merge Readiness

This pull request makes good progress on enhancing request abortion capabilities. However, there are a couple of issues identified (a potential memory leak and metrics accuracy for prompt_tokens) that should be addressed before merging. The memory leak is a high-severity concern. Additionally, consider adding unit tests for the new abort functionalities and updating any relevant API documentation.

I am unable to approve pull requests, but I recommend addressing the high and medium severity comments before this PR is merged by other reviewers.

gemini-code-assist · 2025-06-04T02:48:46Z

python/sglang/srt/managers/tokenizer_manager.py

+        state = self.rid_to_state[recv_obj.rid]
+        state.finished = True
+        state.out_list.append(
+            {
+                "text": "",
+                "meta_info": {
+                    "id": recv_obj.rid,
+                    "finish_reason": {
+                        "type": "abort",
+                        "message": "Abort before prefill",
+                    },
+                    "prompt_tokens": 1,
+                    "completion_tokens": 0,
+                },
+            }
+        )
+        state.event.set()


It appears that when a request is handled by _handle_abort_req, its state is marked as finished, but it's not removed from self.rid_to_state. In contrast, _handle_batch_output does remove finished requests (e.g., del self.rid_to_state[rid] around line 1223 in the full file).

Could this lead to a memory leak if aborted request states accumulate in self.rid_to_state? If so, should del self.rid_to_state[recv_obj.rid] be added at the end of this handler, similar to how normally completed requests are handled?

gemini-code-assist · 2025-06-04T02:48:46Z

python/sglang/srt/managers/tokenizer_manager.py

+                        "type": "abort",
+                        "message": "Abort before prefill",
+                    },
+                    "prompt_tokens": 1,


The meta_info for an aborted request sets "prompt_tokens": 1. If a request is aborted "before prefill", it's likely that its prompt tokens haven't been processed by the core inference engine.

Would it be more accurate to set "prompt_tokens": 0 here, or perhaps use the actual length of the input prompt if it's readily available (e.g., from state.obj.input_ids if tokenized, or 0 if not yet tokenized/processed)? Using 1 might be misleading for metrics or accounting if no tokens were actually processed by the model.

Suggested change

"prompt_tokens": 1,

"prompt_tokens": 0,

zhyncs · 2025-06-23T01:04:01Z

please rebase

merrymercy · 2025-06-30T01:31:03Z

python/sglang/srt/managers/scheduler.py

@@ -2033,6 +2033,9 @@ def abort_request(self, recv_req: AbortReq):
        # Delete requests in the waiting queue
        to_del = []
        for i, req in enumerate(self.waiting_queue):
+            if recv_req.rid == "":


use a constant ABORT_ALL_RID instead of ""

merrymercy · 2025-06-30T01:31:59Z

python/sglang/srt/managers/tokenizer_manager.py

+                },
+            }
+        )
+        state.event.set()


will this make sure the state is deleted in self.rid_to_state?

zhaochenyang20 · 2025-06-30T18:19:52Z

This PR will be discarded and use #6698 instead. Thanks for contribution.

[rl] support abort all and fix abort on waiting queue

fcf2ac1

zhuzilin requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, CatherineSue and xiezhq-hermann as code owners June 4, 2025 02:47

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 4, 2025

View reviewed changes

Merge branch 'main' into feature/abort

675cbbe

tianyuzhou95 mentioned this pull request Jun 10, 2025

Support updating weights at once by stopping all requests #6698

Merged

change prompt_tokens to 0

b76bdd1

zhuzilin mentioned this pull request Jun 21, 2025

[sglang] Tracking sglang compatibility in slime THUDM/slime#6

Open

15 tasks

zhuzilin changed the title ~~[rl] support abort all and fix abort on waiting queue~~ [RL] support abort all and fix abort on waiting queue Jun 21, 2025

merrymercy requested changes Jun 30, 2025

View reviewed changes

zhaochenyang20 closed this Jun 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RL] support abort all and fix abort on waiting queue #6855

[RL] support abort all and fix abort on waiting queue #6855

Uh oh!

zhuzilin commented Jun 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 4, 2025

Uh oh!

gemini-code-assist bot Jun 4, 2025

Uh oh!

zhyncs commented Jun 23, 2025

Uh oh!

merrymercy Jun 30, 2025

Uh oh!

merrymercy Jun 30, 2025

Uh oh!

zhaochenyang20 commented Jun 30, 2025

Uh oh!

Uh oh!

[RL] support abort all and fix abort on waiting queue #6855

[RL] support abort all and fix abort on waiting queue #6855

Uh oh!

Conversation

zhuzilin commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Jun 23, 2025

Uh oh!

merrymercy Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Jun 30, 2025

Uh oh!

Uh oh!

zhuzilin commented Jun 4, 2025 •

edited

Loading