Open
Description
Here is the development roadmap for 2025 H1. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). The previous 2024 Q4 roadmap can be found in #1487
Focus
- Throughput-oriented large-scale deployment similar to the deepseek inference system
- Long context optimizations
- Low latency speculative decoding
- Reinforcement learning training framework integration
- Kernel optimizations
Parallelism
- Support PD disaggregation @ByronHsu [Roadmap] Prefill and Decoding Disaggregation #4655
- Support expert parallelism and load balancer One branch that contains EPLB + Two Batch Overlap + dependencies #5524
- Support pipeline parallelism @Ying1123 [PP] Add pipeline parallelism #5724
- Support data parallelism attention compatible with all other parallelism Improve DP attention #4390
- Support overlap communication in TP/EP @tom @Zhuohao-Li Support overlapping two batches #4068
- Improvements of sgl-router for better data parallelism @Qihang-Zhang
Attention Backend
- Support Native FlashAttention3 as Attention Backend: [Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709 @hebiao064 @qingquansong @zcnrex @Fridge003 @yinfan98
- Torch FlexAttention @HaiShaw @ispobock
Caching
- Optimize Hierarchical cache (GPU/CPU/Disk) Hierarchical Caching for SGLang #2693 Hierarchical Caching supports MLA #4009 @xiezhq-hermann
- Integrate DeepSeek 3FS @yizhang2077
Kernel
- integrate flash attention 3 [Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709
- Integrate DeepGemm linear support deepgemm #4199 Integrate DeepGemm contiguous group gemm into Fused MoE #4343
- Integrate FlashMLA Support FlashMLA backend #4472 Support FlashMLA backend cuda graph #4514
- Integrate cuDNN attention. reference
- Integrate TransformerEngine layers
- Start to maintain performant attention ops in sgl-kernel
- Start to maintain more sparse attention ops in sgl-kernel
- Integrate Blackwell kernels from flashinfer [Feature] integrate FlashInfer Blackwell kernels #5855
Quantization
- MXFP4 support @HaiShaw
- INT4-FP8 MoE & Fused MoE @HaiShaw @carlushuang ROCm: enable trillion-parameter MoE models with INT4-FP8 single node #4152
- W8A8 (FP8 and INT8) implementation in sgl-kernel, removing vllm dependency. Apply sgl w8a8 fp8 kernel #3148 support w8a8 fp8 kernel with CUTLASS #3047
- Integration of awq and gptq in sgl-kernel, removing vllm dependency
- TorchAO support extension to additional models
- Blackwell FP4 support FP4 weight loading and inference (2/2) #3972
- Optional quantization support using vllm's implementation (e.g. bnb, gguf)
- Communication quant
- unsloth model support @guapisolo @XueyingJia @yyihuang
RL Framework integration
- veRL integration SGLang + Verl #3852 @fzyzcjy @zhaochenyang20 @ocss884
- Multi-turn RL Support for mutliturn online RL training volcengine/verl#385 https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/release_log/verl-multiturn-rollout-Release.md @UbeCc @PeterSH6
- Work as the default engine in AREAL https://github.com/inclusionAI/AReaL
- VLM RLHF @yiranyyu @PeterSH6 @zhaochenyang20 @tongyx361 @shuaills
- GRPO to trl @jhinpan
Core refactor
- Support page size > 1 Support page size > 1 #4356
- Simplify
scheduler.py
andmodel_runner.py
to make them more modular - Integrate CacheTensorManager from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
- Integrate Cross-Process Request Object from https://github.com/ModelTC/lightllm/releases/tag/v1.0.0
- Remove the dependency of vLLM @zhyncs @ByronHsu @yizhang2077 [Track] progress in removing vLLM dependencies #2245
Speculative decoding
- Optimizations for large batch @FrankLeeeee @yukavio optimize speculative decoding with high throughput #6995
- Adaptive speculative decoding according to batch sizes
- Reference-based speculative decoding Reference speculative decoding #270 Speculative decoding with lookahead #2790
Multi-LoRA serving
- Add Triton backend for lora kernels @Fridge003 [Feature] Define backends and add Triton backend for Lora #3161
- Support Tensor Parallelism @ShenAo1111 [Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274
- Support cuda graph @Qiaolin-Yu @Beichen-Ma Feat: support cuda graph for LoRA #4115
- Support radix attention @Sunt-ing @jcbjcbjc
- Support embedding layers @Beichen-Ma
- Support Unified Paging @Sunt-ing @jcbjcbjc [Feature] add multi-rank support for Lora #4492
- Optimizing speed with cublas/cutlass kernels @Fridge003 @jcbjcbjc
- Support dynamic loading and unloading @lifuhuang Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support #7412 Support dynamic LoRA loading / unloading in engine/server API #7446
Hardware
- Blackwell support feat: add blackwell workflow #5303
- AMD aiter integration @HaiShaw
- Optimized CPU backends
- More backends (Intel XPU, TPU)
Model coverage
- Multi-modal models
- DeepSeek VL2 [Feature] Support DeepSeek VL 2 #2653
- mistralai/Pixtral [Feature] Support mistralai/Pixtral #2351
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA https://arxiv.org/abs/2412.04468 @Lyken17
- MiniCPM-o model: Minicpmo #3023 @mickqian @yiranyyu @yizhang2077
- Janus-pro model: Support Janus-pro #3203 @mickqian @yizhang2077
- intern-vl 2.5 model: Intern vl 2.5 #3351 @mickqian @yizhang2077
- Phi4-multimodal vision Support Phi-4 Multi-Modal (text + vision only) #6494 @lifuhuang
- upstream transformers to 4.50.0 @yizhang2077 [Bug Fix] Add partial rotary factor support for Phi-4 and upgrade to transformers v4.50.0 #3984
- Language models
- Mamba models
- Transformers backend [FEAT] Add transformers backend support #5929
Function Calling
- Structural Tag @minleminzui @shuaills @Ubospica
- Adapter Refactor @CatherineSue @shuaills @Qiaolin-Yu
Others
- A padded batch mode to make results more deterministic
Line 3 in 8912b76
- Add nightly eval CI by using lm eval harness @XiaotongJiang @PopSoda2002 @ziliangpeng @Monstertail
- Add open-to-use grafana @PopSoda2002 @ziliangpeng