Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
The current implementation of KV-Cache compression in SGLang provides robust support for DoubleSparsity (approximated attention with token selection, #1459 ) and FP8 quantization (#2786 ), enabling effective compression for long-context inference. However, two significant opportunities exist to further enhance its capabilities:
1. Support for Quest (Improved Approximated Attention Method) [Short term]
Recent studies, such as HashAttention paper [1], indicate that Quest [2] provides a more accurate approximation metric for attention compared to DoubleSparsity. By leveraging more precise attention approximations, we could improve accuracy and enable higher sparsity for cases where precision is critical.
2. Support the Combination with KV-Compression Approaches [Longterm]
DoubleSparsity retains all tokens in memory while selectively loading them for processing. As the context length increases, this retention can challenge memory requirements. While CPU offloading provides a feasible workaround, a promising enhancement is to jointly support token selection with quantization. This combined approach would:
- Reduce memory requirements through quantization.
- Enhance latency by fusing dequantization with kernel operations, optimizing runtime performance.
Additionally, supporting token eviction methods—permanently dropping non-important tokens—could further address memory constraints. As highlighted in #2510, token eviction methods (such as SnapKV[3] and PyramidKV[4]) would complement token selection by enabling aggressive memory management in scenarios with long contexts or resource limitations, such as streaming applications or low-resource deployments.
Expected Improvement
By incorporating these two enhancements, SGLang can achieve:
- Improved Accuracy: Leveraging Quest for approximated attention will boost the performance and inference reliability.
- Enhanced Memory Savings: Combining token selection with quantization and enabling token eviction will significantly optimize memory requirements, ensuring scalability for various deployment scenarios.
Related resources
- HashAttention: https://arxiv.org/pdf/2412.14468v1
- Quest: https://arxiv.org/abs/2406.10774
- SnapKV: https://arxiv.org/abs/2404.14469
- PyramidKV: https://arxiv.org/abs/2406.02069
cc @merrymercy