-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[v1] [P/D] Adding LMCache KV connector for v1 #16625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1] [P/D] Adding LMCache KV connector for v1 #16625
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
36c97d1
to
f6b9519
Compare
f6b9519
to
6a11a8a
Compare
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: ApostaC <[email protected]>
6a11a8a
to
4162650
Compare
Signed-off-by: ApostaC <[email protected]>
Does it support xpyd? |
@randomseed713 We are working on this. Should be ready sometime this week |
Question: does this rely on Ray to do the communication? I try to run the example in the PR and get encountered into issue like:
|
Signed-off-by: YaoJiayi <[email protected]>
Signed-off-by: YaoJiayi <[email protected]>
Signed-off-by: YaoJiayi <[email protected]>
Signed-off-by: YaoJiayi <[email protected]>
@liuzijing2014 This PR doesn't depend on Ray. Can you share your command and environment details? I'm also in vLLM's slack workspace (name: Yihua Cheng), so feel free to DM me if you are also there. |
Signed-off-by: ApostaC <[email protected]>
Does it support multi-nodes? Which version of lmcache should I install? Which python version should I use? Which pytorch version should I use?
|
… file. Signed-off-by: KuntaiDu <[email protected]>
…MCache/ folder Signed-off-by: KuntaiDu <[email protected]>
…ples that are not related to LLM inference. Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
|
||
### Prerequisites | ||
|
||
- Install [LMCache](https://github.com/ai-dynamo/lmcache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks for catching. I will submit a PR to fix this.
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
os.environ["LMCACHE_REMOTE_SERDE"] = "naive" | ||
|
||
prompts = [ | ||
"Hello, how are you?" * 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ApostaC
we have a simple question:
Here the input prompts Hello, how are you?
are duplicated 1000 times. Does the following feature of LMCache means the KV Cache can be shared only when the input prompts from different request are totally same?
Flexible KV cache pooling (sharing KV cache across multiple vLLM instances)
|
Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: Yuqi Zhang <[email protected]>
Signed-off-by: minpeter <[email protected]>
TL;DR: LMCache connector offers the following enhancements based on LMCache:
Example Usage
Disaggregated prefill
LMCache uses NIXL as the underlying KV transmission.
Run
cd examples/lmcache/disagg_prefill_lmcache_v1
to get intodisagg_prefill_lmcache_v1
folder, and then runPerformance benchmarking:
Environment: 2x H100 with NVLink
Baselines
Workload: Random dataset (see
benchmarks/benchmark_serving.py
):python3 benchmark_serving.py --port 9000 --seed $(date +%s) \ --model meta-llama/Llama-3.1-8B-Instruct \ --dataset-name random --random-input-len 8000 --random-output-len 200 \ --num-prompts 200 --burstiness 100 --request-rate 3.6
Comparison result
With LMCache-based PD disaggregation, we can achieve 40% higher tokens per second and 8x better tail inter-token latency.

CPU offloading
Run
cd examples/lmcache/disagg_prefill_lmcache_v1
to get intodisagg_prefill_lmcache_v1
folder, and then runKV cache sharing
Run
cd examples/lmcache/disagg_prefill_lmcache_v1
to get intodisagg_prefill_lmcache_v1
folder, and then run