Development Roadmap (2025 H1)

Here is the development roadmap for 2025 H1. Contributions and feedback are welcome ([**Join Bi-weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing)). The previous 2024 Q4 roadmap can be found in #1487

## Focus
- Throughput-oriented large-scale deployment similar to the [deepseek inference system](https://github.com/deepseek-ai/open-infra-index?tab=readme-ov-file#day-6---one-more-thing-deepseek-v3r1-inference-system-overview)
- Long context optimizations
- Low latency speculative decoding
- Reinforcement learning training framework integration
- Kernel optimizations

## Parallelism
- [x] Support PD disaggregation @ByronHsu  #4655
- [x] Support expert parallelism and load balancer #5524
- [x] Support pipeline parallelism @Ying1123 #5724
- [x] Support data parallelism attention compatible with all other parallelism #4390 
- [x] Support overlap communication in TP/EP @tom @Zhuohao-Li #4068
- [ ] Improvements of sgl-router for better data parallelism @Qihang-Zhang 

## Attention Backend
- [x] Support Native FlashAttention3 as Attention Backend: https://github.com/sgl-project/sglang/issues/4709 @hebiao064 @qingquansong @zcnrex @Fridge003 @yinfan98 
- [ ] Torch FlexAttention @HaiShaw @ispobock 

## Caching
- [x] Optimize Hierarchical cache  (GPU/CPU/Disk) #2693 #4009 @xiezhq-hermann 
- [ ] Integrate DeepSeek [3FS](https://github.com/deepseek-ai/3FS) @yizhang2077 

## Kernel
- [x] integrate flash attention 3 #4709
- [x] Integrate DeepGemm #4199 #4343
- [x] Integrate FlashMLA #4472 #4514
- [ ] Integrate cuDNN attention. [reference](https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb)
- [ ] Integrate TransformerEngine layers
- [x] Start to maintain performant attention ops in sgl-kernel
- [x] Start to maintain more sparse attention ops in sgl-kernel
- [ ] Integrate Blackwell kernels from flashinfer #5855

## Quantization
- [ ] MXFP4 support @HaiShaw 
- [x] INT4-FP8 MoE & Fused MoE @HaiShaw @Carlushuang #4152
- [x] W8A8 (FP8 and INT8) implementation in sgl-kernel, removing vllm dependency. #3148 #3047
- [ ] Integration of awq and gptq in sgl-kernel, removing vllm dependency
- [ ] TorchAO support extension to additional models
- [x] Blackwell FP4 support #3972
- [ ] Optional quantization support using vllm's implementation (e.g. bnb, gguf)
- [ ] Communication quant
- [ ] unsloth model support @guapisolo @XueyingJia @yyihuang

## RL Framework integration
- [x] veRL integration #3852 @fzyzcjy @zhaochenyang20 @ocss884
- [x] Multi-turn RL https://github.com/volcengine/verl/issues/385  https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/release_log/verl-multiturn-rollout-Release.md @UbeCc @PeterSH6
- [X] Work as the default engine in AREAL https://github.com/inclusionAI/AReaL
- [ ] VLM RLHF @yiranyyu @PeterSH6 @zhaochenyang20 @tongyx361 @shuaills 
- [ ] GRPO to trl @jhinpan 

## Core refactor
- [x] Support page size > 1 #4356
- [x] Simplify `scheduler.py` and `model_runner.py` to make them more modular 
- [ ] Integrate CacheTensorManager from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
- [ ] Integrate Cross-Process Request Object from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
- [x] Remove the dependency of vLLM @zhyncs @ByronHsu @yizhang2077 https://github.com/sgl-project/sglang/issues/2245

## Speculative decoding
- [ ] Optimizations for large batch @FrankLeeeee @yukavio  #6995
- [ ] Adaptive speculative decoding according to batch sizes
- [ ] Reference-based speculative decoding #270 #2790

## Multi-LoRA serving
- [x] Add Triton backend for lora kernels @Fridge003  #3161
- [x] Support Tensor Parallelism @ShenAo1111 #4274
- [x] Support cuda graph @Qiaolin-Yu  @Beichen-Ma #4115
- [ ] Support radix attention @Sunt-ing @jcbjcbjc
- [ ] Support embedding layers @Beichen-Ma 
- [ ] Support Unified Paging @Sunt-ing @jcbjcbjc #4492
- [ ] Optimizing speed with cublas/cutlass kernels @Fridge003 @jcbjcbjc
- [x] Support dynamic loading and unloading @lifuhuang  #7412 #7446 

## Hardware
- [x] Blackwell support #5303
- [x] AMD aiter integration @HaiShaw
- [x] Optimized CPU backends
- [ ] More backends (Intel XPU, TPU)

## Model coverage
- Multi-modal models
  - [x] DeepSeek VL2 https://github.com/sgl-project/sglang/issues/2653
  - [ ] mistralai/Pixtral https://github.com/sgl-project/sglang/issues/2351
  - [ ] GLM 4V https://github.com/sgl-project/sglang/pull/1641
  - [ ] VILA https://arxiv.org/abs/2412.04468 @Lyken17
  - [x] MiniCPM-o https://github.com/sgl-project/sglang/pull/3023 @mickqian @yiranyyu @yizhang2077 
  - [x] Janus-pro https://github.com/sgl-project/sglang/pull/3203 @mickqian @yizhang2077 
  - [ ] intern-vl 2.5 https://github.com/sgl-project/sglang/pull/3351 @mickqian @yizhang2077 
  - [x] Phi4-multimodal vision #6494 @lifuhuang 
  - [x] upstream transformers to 4.50.0 @yizhang2077 https://github.com/sgl-project/sglang/pull/3984
- Language models
  -  [ ] Mamba models
- Transformers backend #5929 

## Function Calling
- [X] Structural Tag @minleminzui @shuaills @Ubospica 
- [X] Adapter Refactor @CatherineSue @shuaills @Qiaolin-Yu 

## Others
-  [ ] A padded batch mode to make results more deterministic https://github.com/sgl-project/sglang/blob/8912b7637f5c8dca0f18c31a17e46f427cf53152/docs/references/faq.md?plain=1#L3
- [ ] Add nightly eval CI by using lm eval harness @XiaotongJiang @PopSoda2002 @ziliangpeng @monstertail
- [ ] Add open-to-use grafana @PopSoda2002 @ziliangpeng



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2025 H1) #4042

Focus

Parallelism

Attention Backend

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

Model coverage

Function Calling

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Development Roadmap (2025 H1) #4042

Description

Focus

Parallelism

Attention Backend

Caching

Kernel

Quantization

RL Framework integration

Core refactor

Speculative decoding

Multi-LoRA serving

Hardware

Model coverage

Function Calling

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions