Skip to content

[Feature] 4-bit quantized prefix cache #1374

Closed
@josephrocca

Description

@josephrocca

Motivation

LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.

This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.

Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.

Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.

I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:

LMDeploy (handles 20 concurrent requests fine):

Using latest (openmmlab/lmdeploy:v0.6.0a0-cu12) docker image on 48GB NVIDIA A40 GPU:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level INFO

SGLang (OOM at >=4 concurrent requests):

Using latest (lmsysorg/sglang:v0.3.0-cu121) docker image on 48GB NVIDIA A40 GPU:

python3 -m sglang.launch_server --model-path lmdeploy/llama2-chat-70b-4bit --context-length 8192 --host 0.0.0.0 --port 3000 --tp-size $(nvidia-smi -L | wc -l)

For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24

It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions