Skip to content

[v1] [P/D] Adding LMCache KV connector for v1 #16625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 26, 2025

Conversation

ApostaC
Copy link
Collaborator

@ApostaC ApostaC commented Apr 15, 2025

TL;DR: LMCache connector offers the following enhancements based on LMCache:

  • Fast KV Cache CPU offloading
  • Flexible KV cache pooling (sharing KV cache across multiple vLLM instances)
  • High-performance PD disaggregation powered by NIXL.

Example Usage

Disaggregated prefill

LMCache uses NIXL as the underlying KV transmission.
Run cd examples/lmcache/disagg_prefill_lmcache_v1 to get into disagg_prefill_lmcache_v1 folder, and then run

bash disagg_example_nixl.sh

Performance benchmarking:

Environment: 2x H100 with NVLink

Baselines

  • 1P1D setup with LMCache + NIXL, each uses 1 GPU (This PR)
  • 2 separate vLLM instances, each uses 1 GPU

Workload: Random dataset (see benchmarks/benchmark_serving.py):

python3 benchmark_serving.py --port 9000 --seed $(date +%s) \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --dataset-name random --random-input-len 8000 --random-output-len 200 \
        --num-prompts 200 --burstiness 100 --request-rate 3.6

Comparison result

With LMCache-based PD disaggregation, we can achieve 40% higher tokens per second and 8x better tail inter-token latency.
image

CPU offloading

Run cd examples/lmcache/disagg_prefill_lmcache_v1 to get into disagg_prefill_lmcache_v1 folder, and then run

python cpu_offload_lmcache_v1.py

KV cache sharing

Run cd examples/lmcache/disagg_prefill_lmcache_v1 to get into disagg_prefill_lmcache_v1 folder, and then run

python kv_cache_sharing_lmcache_v1.py

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@tlrmchlsmth tlrmchlsmth mentioned this pull request Apr 15, 2025
3 tasks
@ApostaC ApostaC force-pushed the local-dev/lmcache-v1-connector-pr branch from 36c97d1 to f6b9519 Compare April 17, 2025 03:35
@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Apr 17, 2025
@ApostaC ApostaC force-pushed the local-dev/lmcache-v1-connector-pr branch from f6b9519 to 6a11a8a Compare April 17, 2025 03:53
@ApostaC ApostaC force-pushed the local-dev/lmcache-v1-connector-pr branch from 6a11a8a to 4162650 Compare April 17, 2025 20:58
@ApostaC ApostaC marked this pull request as ready for review April 17, 2025 20:59
@randomseed713
Copy link

Does it support xpyd?

@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 21, 2025

Does it support xpyd?

@randomseed713 We are working on this. Should be ready sometime this week

@liuzijing2014
Copy link
Collaborator

liuzijing2014 commented Apr 22, 2025

Question: does this rely on Ray to do the communication? I try to run the example in the PR and get encountered into issue like:

(autoscaler +4m47s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:2401:db00:eef0:1120:3520:0:9408:fcef': 0.001}. Add suitable node types to this cluster to resolve this issue.
INFO 04-22 16:19:26 [ray_utils.py:233] Waiting for creating a placement group of specs for 310 seconds. specs=[{'GPU': 1.0, 'node:2401:db00:eef0:1120:3520:0:9408:fcef': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Apr 23, 2025

Question: does this rely on Ray to do the communication?

@liuzijing2014 This PR doesn't depend on Ray. Can you share your command and environment details? I'm also in vLLM's slack workspace (name: Yihua Cheng), so feel free to DM me if you are also there.

@Huixxi
Copy link

Huixxi commented Apr 24, 2025

Does it support multi-nodes? Which version of lmcache should I install? Which python version should I use? Which pytorch version should I use?
And I met the error:

from lmcache.experimental.cache_engine import LMCacheEngine
(VllmWorker rank=0 pid=150410) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
(VllmWorker rank=3 pid=150413) ERROR 04-24 07:04:06 [multiproc_executor.py:435]     from lmcache.storage_backend.hybrid_backend import \
(VllmWorker rank=2 pid=150412) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
(VllmWorker rank=4 pid=150414) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "/home/logs/LMCache/lmcache/experimental/cache_engine.py", line 24, in <module>
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435]     from lmcache.storage_backend.serde import CreateSerde, Deserializer
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "/home/logs/LMCache/lmcache/storage_backend/serde/__init__.py", line 5, in <module>
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435]     from lmcache.storage_backend.serde.cachegen_decoder import CacheGenDeserializer
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "/home/logs/LMCache/lmcache/storage_backend/serde/cachegen_decoder.py", line 4, in <module>
(VllmWorker rank=7 pid=150417) ERROR 04-24 07:04:06 [multiproc_executor.py:435]     return _bootstrap._gcd_import(name[level:], package, level)
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435]     import torchac_cuda  # type: ignore
(VllmWorker rank=7 pid=150417) ERROR 04-24 07:04:06 [multiproc_executor.py:435]   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
(VllmWorker rank=6 pid=150416) ERROR 04-24 07:04:06 [multiproc_executor.py:435] ImportError: /home/logs/huanxi/lmcache_venv_hu/lib/python3.10/site-packages/torchac_cuda.cpython-310-x86_64-linux-gnu.so:
 undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

@ApostaC ApostaC changed the title [WIP] [v1] [P/D] Adding LMCache KV connector for v1 [v1] [P/D] Adding LMCache KV connector for v1 Apr 25, 2025
Signed-off-by: KuntaiDu <[email protected]>
@KuntaiDu KuntaiDu enabled auto-merge (squash) April 26, 2025 01:16
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 26, 2025
@KuntaiDu KuntaiDu merged commit 5e83a72 into vllm-project:main Apr 26, 2025
68 checks passed

### Prerequisites

- Install [LMCache](https://github.com/ai-dynamo/lmcache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for catching. I will submit a PR to fix this.

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
os.environ["LMCACHE_REMOTE_SERDE"] = "naive"

prompts = [
"Hello, how are you?" * 1000,
Copy link

@zejun-chen zejun-chen May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ApostaC
we have a simple question:
Here the input prompts Hello, how are you? are duplicated 1000 times. Does the following feature of LMCache means the KV Cache can be shared only when the input prompts from different request are totally same?
Flexible KV cache pooling (sharing KV cache across multiple vLLM instances)

@zhaotyer
Copy link
Contributor

zhaotyer commented May 8, 2025

[1746690477.974497] [llm206:611  :0]     ucp_context.c:1268 UCX  WARN  transports 'cuda_ipc','cuda_copy' are not available, please use one or more of: mm, posix, self, shm, sm, sysv, tcp
Backend UCX was instantiated
Initialized NIXL agent: NixlRole.SENDER
ERROR 05-08 00:47:57 [core.py:396] EngineCore failed to start.
ERROR 05-08 00:47:57 [core.py:396] Traceback (most recent call last):
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-08 00:47:57 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-08 00:47:57 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-08 00:47:57 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 05-08 00:47:57 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self._init_executor()
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 05-08 00:47:57 [core.py:396]     self.collective_rpc("init_device")
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-08 00:47:57 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-08 00:47:57 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
ERROR 05-08 00:47:57 [core.py:396]     return func(*args, **kwargs)
ERROR 05-08 00:47:57 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 05-08 00:47:57 [core.py:396]     self.worker.init_device()  # type: ignore
ERROR 05-08 00:47:57 [core.py:396]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 135, in init_device
ERROR 05-08 00:47:57 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ERROR 05-08 00:47:57 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
ERROR 05-08 00:47:57 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
ERROR 05-08 00:47:57 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 73, in create_connector_v1
ERROR 05-08 00:47:57 [core.py:396]     return connector_cls(config, role)
ERROR 05-08 00:47:57 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 25, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self._lmcache_engine = LMCacheConnectorV1Impl(vllm_config, role, self)
ERROR 05-08 00:47:57 [core.py:396]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 314, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self.lmcache_engine = init_lmcache_engine(
ERROR 05-08 00:47:57 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 111, in init_lmcache_engine
ERROR 05-08 00:47:57 [core.py:396]     engine = LMCacheEngineBuilder.get_or_create(ENGINE_NAME, config, metadata,
ERROR 05-08 00:47:57 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 449, in get_or_create
ERROR 05-08 00:47:57 [core.py:396]     engine = LMCacheEngine(config, metadata, memory_allocator,
ERROR 05-08 00:47:57 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 98, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self.storage_manager = DistributedStorageManager(
ERROR 05-08 00:47:57 [core.py:396]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/storage_manager.py", line 535, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self.storage_backend = NixlBackend.CreateNixlBackend(config, metadata)
ERROR 05-08 00:47:57 [core.py:396]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/nixl_backend.py", line 412, in CreateNixlBackend
ERROR 05-08 00:47:57 [core.py:396]     backend = NixlBackend(nixl_config)
ERROR 05-08 00:47:57 [core.py:396]               ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/nixl_backend.py", line 249, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self._nixl_channel = NixlChannel(nixl_config)
ERROR 05-08 00:47:57 [core.py:396]                          ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/nixl_connector_v2.py", line 454, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self._pipe = NixlPipe(nixl_config, self._side_channel)
ERROR 05-08 00:47:57 [core.py:396]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/nixl_connector_v2.py", line 190, in __init__
ERROR 05-08 00:47:57 [core.py:396]     self._reg_descs = self._agent.register_memory(self._transfer_buffers)
ERROR 05-08 00:47:57 [core.py:396]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 00:47:57 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/nixl/_api.py", line 265, in register_memory
ERROR 05-08 00:47:57 [core.py:396]     self.agent.registerMem(reg_descs, handle_list)
ERROR 05-08 00:47:57 [core.py:396] nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants