Skip to content

Commit 99f0efd

Browse files
ApostaCYuqi Zhang
authored andcommitted
[v1] [P/D] Adding LMCache KV connector for v1 (vllm-project#16625)
Signed-off-by: Yuqi Zhang <[email protected]>
1 parent 3fa950a commit 99f0efd

File tree

12 files changed

+793
-0
lines changed

12 files changed

+793
-0
lines changed

examples/lmcache/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# LMCache Examples
2+
3+
This folder demonstrates how to use LMCache for disaggregated prefilling, CPU offloading and KV cache sharing.
4+
5+
## 1. Disaggregated Prefill in vLLM v1
6+
7+
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
8+
9+
### Prerequisites
10+
11+
- Install [LMCache](https://github.com/ai-dynamo/lmcache)
12+
- Install [NIXL](https://github.com/ai-dynamo/nixl)
13+
- At least 2 GPUs
14+
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
15+
16+
### Usage
17+
18+
Run
19+
`cd disagg_prefill_lmcache_v1`
20+
to get into `disagg_prefill_lmcache_v1` folder, and then run
21+
22+
```bash
23+
bash disagg_example_nixl.sh
24+
```
25+
26+
to run disaggregated prefill and benchmark the performance.
27+
28+
### Components
29+
30+
#### Server Scripts
31+
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
32+
- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
33+
- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example
34+
35+
#### Configuration
36+
- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
37+
- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server
38+
39+
#### Log Files
40+
The main script generates several log files:
41+
- `prefiller.log` - Logs from the prefill server
42+
- `decoder.log` - Logs from the decode server
43+
- `proxy.log` - Logs from the proxy server
44+
45+
## 2. CPU Offload Examples
46+
47+
- `cpu_offload_lmcache_v0.py` - CPU offloading implementation for vLLM v0
48+
- `cpu_offload_lmcache_v1.py` - CPU offloading implementation for vLLM v1
49+
50+
## 3. KV Cache Sharing
51+
52+
The `kv_cache_sharing_lmcache_v1.py` example demonstrates how to share KV caches between vLLM v1 instances.
53+
54+
## 4. Disaggregated Prefill in vLLM v0
55+
56+
The `disaggregated_prefill_lmcache_v0.py` provides an example of how to run disaggregated prefill in vLLM v0.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
"""
3+
This file demonstrates the example usage of cpu offloading
4+
with LMCache in vLLM v1.
5+
6+
Note that lmcache needs to be installed to run this example.
7+
Learn more about LMCache in https://github.com/LMCache/LMCache.
8+
"""
9+
import os
10+
11+
from lmcache.experimental.cache_engine import LMCacheEngineBuilder
12+
from lmcache.integration.vllm.utils import ENGINE_NAME
13+
14+
from vllm import LLM, SamplingParams
15+
from vllm.config import KVTransferConfig
16+
17+
# LMCache-related environment variables
18+
# Use experimental features in LMCache
19+
os.environ["LMCACHE_USE_EXPERIMENTAL"] = "True"
20+
# LMCache is set to use 256 tokens per chunk
21+
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
22+
# Enable local CPU backend in LMCache
23+
os.environ["LMCACHE_LOCAL_CPU"] = "True"
24+
# Set local CPU memory limit to 5.0 GB
25+
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "5.0"
26+
27+
# This example script runs two requests with a shared prefix.
28+
shared_prompt = "Hello, how are you?" * 1000
29+
first_prompt = [
30+
shared_prompt + "Hello, my name is",
31+
]
32+
second_prompt = [
33+
shared_prompt + "Tell me a very long story",
34+
]
35+
36+
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)
37+
38+
ktc = KVTransferConfig.from_cli(
39+
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}')
40+
# Set GPU memory utilization to 0.8 for an A40 GPU with 40GB
41+
# memory. Reduce the value if your GPU has less memory.
42+
# Note that LMCache is not compatible with chunked prefill for now.
43+
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
44+
kv_transfer_config=ktc,
45+
max_model_len=8000,
46+
gpu_memory_utilization=0.8)
47+
48+
# Should be able to see logs like the following:
49+
# `LMCache INFO: Storing KV cache for 6006 out of 6006 tokens for request 0`
50+
# This indicates that the KV cache has been stored in LMCache.
51+
outputs = llm.generate(first_prompt, sampling_params)
52+
for output in outputs:
53+
generated_text = output.outputs[0].text
54+
print(f"Generated text: {generated_text!r}")
55+
56+
# Clean up lmcache backend
57+
LMCacheEngineBuilder.destroy(ENGINE_NAME)
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
local_cpu: False
2+
max_local_cpu_size: 0
3+
#local_disk:
4+
max_local_disk_size: 0
5+
remote_serde: NULL
6+
7+
enable_nixl: True
8+
nixl_role: "receiver"
9+
nixl_peer_host: "localhost"
10+
nixl_peer_port: 55555
11+
nixl_buffer_size: 1073741824 # 1GB
12+
nixl_buffer_device: "cuda"
13+
nixl_enable_gc: True
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
local_cpu: False
2+
max_local_cpu_size: 0
3+
#local_disk:
4+
max_local_disk_size: 0
5+
remote_serde: NULL
6+
7+
enable_nixl: True
8+
nixl_role: "sender"
9+
nixl_peer_host: "localhost"
10+
nixl_peer_port: 55555
11+
nixl_buffer_size: 1073741824 # 1GB
12+
nixl_buffer_device: "cuda"
13+
nixl_enable_gc: True
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
#!/bin/bash
2+
3+
echo "Warning: LMCache disaggregated prefill support for vLLM v1 is experimental and subject to change."
4+
5+
6+
PIDS=()
7+
8+
# Switch to the directory of the current script
9+
cd "$(dirname "${BASH_SOURCE[0]}")"
10+
11+
check_hf_token() {
12+
if [ -z "$HF_TOKEN" ]; then
13+
echo "HF_TOKEN is not set. Please set it to your Hugging Face token."
14+
exit 1
15+
fi
16+
if [[ "$HF_TOKEN" != hf_* ]]; then
17+
echo "HF_TOKEN is not a valid Hugging Face token. Please set it to your Hugging Face token."
18+
exit 1
19+
fi
20+
echo "HF_TOKEN is set and valid."
21+
}
22+
23+
check_num_gpus() {
24+
# can you check if the number of GPUs are >=2 via nvidia-smi?
25+
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
26+
if [ "$num_gpus" -lt 2 ]; then
27+
echo "You need at least 2 GPUs to run disaggregated prefill."
28+
exit 1
29+
else
30+
echo "Found $num_gpus GPUs."
31+
fi
32+
}
33+
34+
ensure_python_library_installed() {
35+
echo "Checking if $1 is installed..."
36+
python -c "import $1" > /dev/null 2>&1
37+
if [ $? -ne 0 ]; then
38+
if [ "$1" == "nixl" ]; then
39+
echo "$1 is not installed. Please refer to https://github.com/ai-dynamo/nixl for installation."
40+
else
41+
echo "$1 is not installed. Please install it via pip install $1."
42+
fi
43+
exit 1
44+
else
45+
echo "$1 is installed."
46+
fi
47+
}
48+
49+
cleanup() {
50+
echo "Stopping everything…"
51+
trap - INT TERM # prevent re-entrancy
52+
kill -- -$$ # negative PID == “this whole process-group”
53+
wait # reap children so we don't leave zombies
54+
exit 0
55+
}
56+
57+
wait_for_server() {
58+
local port=$1
59+
local timeout_seconds=1200
60+
local start_time=$(date +%s)
61+
62+
echo "Waiting for server on port $port..."
63+
64+
while true; do
65+
if curl -s "localhost:${port}/v1/completions" > /dev/null; then
66+
return 0
67+
fi
68+
69+
local now=$(date +%s)
70+
if (( now - start_time >= timeout_seconds )); then
71+
echo "Timeout waiting for server"
72+
return 1
73+
fi
74+
75+
sleep 1
76+
done
77+
}
78+
79+
80+
main() {
81+
check_hf_token
82+
check_num_gpus
83+
ensure_python_library_installed lmcache
84+
ensure_python_library_installed nixl
85+
ensure_python_library_installed pandas
86+
ensure_python_library_installed datasets
87+
ensure_python_library_installed vllm
88+
89+
trap cleanup INT
90+
trap cleanup USR1
91+
trap cleanup TERM
92+
93+
echo "Launching prefiller, decoder and proxy..."
94+
echo "Please check prefiller.log, decoder.log and proxy.log for logs."
95+
96+
bash disagg_vllm_launcher.sh prefiller \
97+
> >(tee prefiller.log) 2>&1 &
98+
prefiller_pid=$!
99+
PIDS+=($prefiller_pid)
100+
101+
bash disagg_vllm_launcher.sh decoder \
102+
> >(tee decoder.log) 2>&1 &
103+
decoder_pid=$!
104+
PIDS+=($decoder_pid)
105+
106+
python3 disagg_proxy_server.py \
107+
--host localhost \
108+
--port 9000 \
109+
--prefiller-host localhost \
110+
--prefiller-port 8100 \
111+
--decoder-host localhost \
112+
--decoder-port 8200 \
113+
> >(tee proxy.log) 2>&1 &
114+
proxy_pid=$!
115+
PIDS+=($proxy_pid)
116+
117+
wait_for_server 8100
118+
wait_for_server 8200
119+
wait_for_server 9000
120+
121+
echo "All servers are up. Starting benchmark..."
122+
123+
# begin benchmark
124+
cd ../../../benchmarks/
125+
python benchmark_serving.py --port 9000 --seed $(date +%s) \
126+
--model meta-llama/Llama-3.1-8B-Instruct \
127+
--dataset-name random --random-input-len 7500 --random-output-len 200 \
128+
--num-prompts 200 --burstiness 100 --request-rate 3.6 | tee benchmark.log
129+
130+
echo "Benchmarking done. Cleaning up..."
131+
132+
cleanup
133+
134+
}
135+
136+
main

0 commit comments

Comments
 (0)