Description
Motivation
LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.
This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.
Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.
Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.
I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:
LMDeploy (handles 20 concurrent requests fine):
Using latest (openmmlab/lmdeploy:v0.6.0a0-cu12
) docker image on 48GB NVIDIA A40 GPU:
lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level INFO
SGLang (OOM at >=4 concurrent requests):
Using latest (lmsysorg/sglang:v0.3.0-cu121
) docker image on 48GB NVIDIA A40 GPU:
python3 -m sglang.launch_server --model-path lmdeploy/llama2-chat-70b-4bit --context-length 8192 --host 0.0.0.0 --port 3000 --tp-size $(nvidia-smi -L | wc -l)
For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24
It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.
Related resources
No response