[Feature] Enhancement on Sparse Attention and KV-Cache Compression

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

The current implementation of KV-Cache compression in SGLang provides robust support for **DoubleSparsity** (approximated attention with token selection, #1459 ) and **FP8 quantization** (#2786 ), enabling effective compression for long-context inference. However, two significant opportunities exist to further enhance its capabilities:


#### 1. Support for Quest (Improved Approximated Attention Method) [Short term]

Recent studies, such as *HashAttention* paper [[1]](https://arxiv.org/pdf/2412.14468v1), indicate that *Quest* [[2]](https://arxiv.org/abs/2406.10774) provides a more accurate approximation metric for attention compared to DoubleSparsity. By leveraging more precise attention approximations, we could improve accuracy and enable higher sparsity for cases where precision is critical. 

#### 2. Support the Combination with KV-Compression Approaches [Longterm]

**DoubleSparsity** retains all tokens in memory while selectively loading them for processing. As the context length increases, this retention can challenge memory requirements. While **CPU offloading** provides a feasible workaround, a promising enhancement is to **jointly support token selection with quantization**. This combined approach would:
- Reduce memory requirements through quantization.
- Enhance latency by fusing dequantization with kernel operations, optimizing runtime performance.

Additionally, supporting **token eviction methods**—permanently dropping non-important tokens—could further address memory constraints. As highlighted in #2510, token eviction methods (such as SnapKV[3] and PyramidKV[4]) would complement token selection by enabling aggressive memory management in scenarios with long contexts or resource limitations, such as streaming applications or low-resource deployments. 


### Expected Improvement

By incorporating these two enhancements, SGLang can achieve:

- **Improved Accuracy:** Leveraging Quest for approximated attention will boost the performance and inference reliability.
- **Enhanced Memory Savings:** Combining token selection with quantization and enabling token eviction will significantly optimize memory requirements, ensuring scalability for various deployment scenarios.

### Related resources
1. *HashAttention*: [https://arxiv.org/pdf/2412.14468v1](https://arxiv.org/pdf/2412.14468v1)  
2. *Quest*: [https://arxiv.org/abs/2406.10774](https://arxiv.org/abs/2406.10774)  
3. *SnapKV*: [https://arxiv.org/abs/2404.14469](https://arxiv.org/abs/2404.14469)
4. *PyramidKV*: [https://arxiv.org/abs/2406.02069](https://arxiv.org/abs/2406.02069)

cc @merrymercy 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Checklist

Motivation

1. Support for Quest (Improved Approximated Attention Method) [Short term]

2. Support the Combination with KV-Compression Approaches [Longterm]

Expected Improvement

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Description

Checklist

Motivation

1. Support for Quest (Improved Approximated Attention Method) [Short term]

2. Support the Combination with KV-Compression Approaches [Longterm]

Expected Improvement

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions