Development Roadmap (2024 Q4)

Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome ([**Join Bi-weekly Development Meeting**](https://t.co/4BFjCLnVHq)). Previous 2024 Q3 roadmap can be found in #634.

## Performance
- [x] Hide CPU overhead with overlapped scheduler (#1738, #2067)
- [x] Support speculative decoding
  - Eagle  #2150 
  - Reference-based. #270
  - Medusa head #859
  - Draft model based.
- [x] Sparse Attention #1459
- [x] Faster grammar parsing library for constrained decoding #1752 
- [x] Multi-layer radix cache (GPU/CPU/Disk) https://github.com/sgl-project/sglang/pull/2693  @xiezhq-hermann 
- [ ] Improve the performance of mixed chunked prefill. see a draft #1383 
- [ ] Integrate CuDNN paged attention [kernels](https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb) 

## Parallelism
- [ ] Support sequence parallelism #1436. Related [paper](https://www.arxiv.org/pdf/2411.01783)
- [ ] Support pipeline parallelism.
- [ ] Support expert parallelism + data parallelism for DeepSeek/MoE models. @ispobock 
    - [x] Data parallelism #1970 
    - [x] Expert parallelism # #1435 
- [x] Implement a better cache-aware load balancer for data parallelism. #2114 #1732 @ByronHsu @yichuan520030910320 
- [ ] Overlap communication in tensor parallelsim. @zhuohaol
- [ ] Support disaggregated serving to separate prefill and decoding.

## Hardware Coverage
- [x] AMD optimizations. cc @HaiShaw 
  - CK kernels
  - Setup CI (accuracy/performance) for AMD
- [x] Intel XPU support.
  - #1480
  - #2121

## Model Coverage
- [x] Multi-modal models
  - Llama 3.2 Vision https://github.com/sgl-project/sglang/pull/1551
  - QWen2-VL https://github.com/sgl-project/sglang/pull/1546
  - DeepSeek VL2 https://github.com/sgl-project/sglang/issues/2653
  - mistralai/Pixtral https://github.com/sgl-project/sglang/issues/2351
  - GLM 4V https://github.com/sgl-project/sglang/pull/1641
  - VILA https://arxiv.org/abs/2412.04468
  - InternVL
  - Phi-vision
  - [FishSpeech](https://github.com/fishaudio/fish-speech) audio model support 
  - [Ultravox](https://github.com/sgl-project/sglang/issues/1271)
- [ ] Language models
  -  Mamba models @rahulbatra85 @HaiShaw 
  - xLSTM
- [x] Reward models
  - #1525 
  - #1954 

## New features
- [ ] Integrate with LMCache https://github.com/LMCache/LMCache
- [ ] A padded batch mode to make results more deterministic https://github.com/sgl-project/sglang/blob/8912b7637f5c8dca0f18c31a17e46f427cf53152/docs/references/faq.md?plain=1#L3
- [x] Performance optimizations for multi-LoRA serving #1728 

## Quantization
@HaiShaw @zhyncs @ispobock 
- [x] Torchao integration #1561
- [x] Turbomind operators integration
- [ ] More CUTLASS mixed precision gemm integration
- [ ] KV cache quantization (more formats + scaling factor)

## Server API
- [x] Support directly taking embedding as inputs. #745
- [x] Add APIs for using the inference engine in a single script without launching a separate server. See also [examples](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html).
  - #1567
- [ ] Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
- [x] Better APIs to support RL trainers, including https://github.com/huggingface/trl and https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20 
- [x] Support generalized reward API (adding linear layers to any Causal LM to get the reward) https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20 

## Observability
- [x] Integrate Grafana / Prometheus
  - #1853  #1461 

## Others
- [x] Notebook-style interactive tutorials. @zhaochenyang20 
- [ ] Compiler mode optimizations for the language (e.g. support sending a full serialized SGL program to the server). @hnyls2002 
- [ ] Memory pool refactor to better support mixing different attention layers (e.g., interleaved window attention). @Ying1123 
- [ ] Make vLLM an optional dependency. @zhyncs @ByronHsu @yizhang2077 https://github.com/sgl-project/sglang/issues/1673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2024 Q4) #1487

Performance

Parallelism

Hardware Coverage

Model Coverage

New features

Quantization

Server API

Observability

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Development Roadmap (2024 Q4) #1487

Description

Performance

Parallelism

Hardware Coverage

Model Coverage

New features

Quantization

Server API

Observability

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions