Skip to content

[server arg] better arg help. disable chunked prefix cache. #6991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions python/sglang/srt/model_executor/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,8 +406,9 @@ def model_specific_adjustment(self):
f"Automatically turn of --chunked-prefill-size as it is not supported for "
f"{self.model_config.hf_config.model_type}"
)

if not self.use_mla_backend:
if server_args.disable_radix_cache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true.. Chunked prefix cache can be used when radix cache is disabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why? there is no prefix cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true.. Chunked prefix cache can be used when radix cache is disabled.

when radix cache is disabled, is there any prefix cache?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here prefix cache means kv cache, which is different from radix cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here prefix cache means kv cache, which is different from radix cache.

if (
                forward_batch.forward_mode.is_extend()
                and not self.disable_chunked_prefix_cache
                and not forward_batch.forward_mode.is_target_verify()
                and not forward_batch.forward_mode.is_draft_extend()
                and (
                    sum_extend_prefix_lens >= self.chunked_prefix_cache_threshold
                    or sum_extend_prefix_lens == 0
                )
            ):
                return AttnForwardMethod.MHA_CHUNKED_KV

but here use when forward_mode.is_extend()
no kv cache if disable radix-cache.
or you mean there is any other way to prefix caching?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KV cache will still be used if disabling radix-cache. It's just not managed by radix tree, and there will be many repeatedly computed tokens.

server_args.disable_chunked_prefix_cache = True
elif not self.use_mla_backend:
server_args.disable_chunked_prefix_cache = True
elif self.page_size > 1:
logger.info("Disable chunked prefix cache when page size > 1.")
Expand Down
2 changes: 1 addition & 1 deletion python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -1411,7 +1411,7 @@ def add_cli_args(parser: argparse.ArgumentParser):
parser.add_argument(
"--disable-chunked-prefix-cache",
action="store_true",
help="Disable chunked prefix cache feature for deepseek, which should save overhead for short sequences.",
help="For Deepseek, Disable chunked-prefix-cache to save overhead for short sequences.",
)
parser.add_argument(
"--disable-fast-image-processor",
Expand Down