[Feature] 4-bit quantized prefix cache

### Motivation

LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.

This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.

Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.

Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.

I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:

### LMDeploy (handles 20 concurrent requests fine):
Using latest (`openmmlab/lmdeploy:v0.6.0a0-cu12`) docker image on 48GB NVIDIA A40 GPU:
```
lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level INFO
```

### SGLang (OOM at >=4 concurrent requests):
Using latest (`lmsysorg/sglang:v0.3.0-cu121`) docker image on 48GB NVIDIA A40 GPU:
```
python3 -m sglang.launch_server --model-path lmdeploy/llama2-chat-70b-4bit --context-length 8192 --host 0.0.0.0 --port 3000 --tp-size $(nvidia-smi -L | wc -l)
```

For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24

It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] 4-bit quantized prefix cache #1374

Motivation

LMDeploy (handles 20 concurrent requests fine):

SGLang (OOM at >=4 concurrent requests):

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] 4-bit quantized prefix cache #1374

Description

Motivation

LMDeploy (handles 20 concurrent requests fine):

SGLang (OOM at >=4 concurrent requests):

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions