-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[server arg] better arg help. disable chunked prefix cache. #6991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server arg] better arg help. disable chunked prefix cache. #6991
Conversation
…ache when disable radix cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @jinmingyi1998, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes. This PR primarily addresses an issue where the chunked prefix cache was not being automatically disabled when the radix prefix cache was turned off. It also includes a minor update to the help text for a command-line argument.
The main motivation, as described by the author, is to ensure consistency in caching behavior when the radix cache is explicitly disabled. If the radix cache is off, the chunked prefix cache should also be off to avoid unexpected behavior or overhead, particularly for certain model types like Deepseek.
Highlights
- Cache Dependency Logic: Added logic in the model runner to automatically disable the chunked prefix cache (
--disable-chunked-prefix-cache
) if the radix prefix cache (--disable-radix-cache
) is enabled. - Argument Help Text Update: Slightly reworded the help text for the
--disable-chunked-prefix-cache
command-line argument for clarity.
Changelog
- python/sglang/srt/model_executor/model_runner.py
- Introduced a check (
if server_args.disable_radix_cache:
) around line 409 to setserver_args.disable_chunked_prefix_cache = True
if the radix cache is disabled. This logic is placed before the existing checks related to the MLA backend and page size.
- Introduced a check (
- python/sglang/srt/server_args.py
- Updated the help string for the
--disable-chunked-prefix-cache
argument on line 1414, changing it from "Disable chunked prefix cache feature for deepseek, which should save overhead for short sequences." to "For Deepseek, Disable chunked-prefix-cache to save overhead for short sequences."
- Updated the help string for the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly ensures that the chunked prefix cache is disabled when the radix cache is disabled, which improves configuration consistency. The update to the help string for --disable-chunked-prefix-cache
also aims to clarify its usage, particularly for Deepseek models. The logical change in model_runner.py
is sound and well-placed. Overall, the changes are clear and address the stated motivation effectively.
Summary of Findings
- Help string phrasing for
--disable-chunked-prefix-cache
: Inpython/sglang/srt/server_args.py
, the updated help stringFor Deepseek, Disable chunked-prefix-cache...
is functional but could be phrased more clearly. Consider alternatives likeFor Deepseek models, disable the chunked prefix cache...
orDisable chunked prefix cache (primarily for Deepseek models)...
for improved readability. This is a low-severity stylistic suggestion and was not added as a direct review comment due to the project's review settings for comment severity.
Merge Readiness
The pull request appears to be in good shape. The core logic change correctly addresses the issue of ensuring the chunked prefix cache is disabled when the radix cache is disabled. The help string update is a minor clarification. I have not found any critical or high-severity issues. Based on this review, the PR seems ready for merging, pending any further review or testing procedures by the maintainers. As an AI, I am not authorized to approve pull requests.
@@ -406,8 +406,9 @@ def model_specific_adjustment(self): | |||
f"Automatically turn of --chunked-prefill-size as it is not supported for " | |||
f"{self.model_config.hf_config.model_type}" | |||
) | |||
|
|||
if not self.use_mla_backend: | |||
if server_args.disable_radix_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true.. Chunked prefix cache can be used when radix cache is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why? there is no prefix cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true.. Chunked prefix cache can be used when radix cache is disabled.
when radix cache is disabled, is there any prefix cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here prefix cache means kv cache, which is different from radix cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here prefix cache means kv cache, which is different from radix cache.
if (
forward_batch.forward_mode.is_extend()
and not self.disable_chunked_prefix_cache
and not forward_batch.forward_mode.is_target_verify()
and not forward_batch.forward_mode.is_draft_extend()
and (
sum_extend_prefix_lens >= self.chunked_prefix_cache_threshold
or sum_extend_prefix_lens == 0
)
):
return AttnForwardMethod.MHA_CHUNKED_KV
but here use when forward_mode.is_extend()
no kv cache if disable radix-cache.
or you mean there is any other way to prefix caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KV cache will still be used if disabling radix-cache. It's just not managed by radix tree, and there will be many repeatedly computed tokens.
disable chunked prefix cache when disable-radix-cache is True
Motivation
when disable radix prefix cache, I found chunked-prefix-cache is True. should be disabled.
related PR: #5113
Modifications
set disable_chunked_prefix_cache when radix prefix cache is disabled.
Checklist