Closed
Description
Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). Previous 2024 Q3 roadmap can be found in #634.
Performance
- Hide CPU overhead with overlapped scheduler (Faster overlap mode scheduler #1738, Enable overlap by default #2067)
- Support speculative decoding
- Eagle Eagle speculative decoding part 4: Add EAGLE2 worker #2150
- Reference-based. Reference speculative decoding #270
- Medusa head [Feature] plan to support medusa? #859
- Draft model based.
- Sparse Attention Support double sparsity #1459
- Faster grammar parsing library for constrained decoding [Performance] Support both xgrammar and outlines for constrained decoding #1752
- Multi-layer radix cache (GPU/CPU/Disk) Hierarchical Caching for SGLang #2693 @xiezhq-hermann
- Improve the performance of mixed chunked prefill. see a draft Rewrite mixed chunked prefill #1383
- Integrate CuDNN paged attention kernels
Parallelism
- Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper
- Support pipeline parallelism.
- Support expert parallelism + data parallelism for DeepSeek/MoE models. @ispobock
- Data parallelism Support DP MLA #1970
- Expert parallelism # [Feature] Expert parallelism support #1435
- Implement a better cache-aware load balancer for data parallelism. [router] cache-aware load-balancing router v1 #2114 [Feature] Cache-aware Data Parallel Router #1732 @ByronHsu @yichuan520030910320
- Overlap communication in tensor parallelsim. @ZhuohaoL
- Support disaggregated serving to separate prefill and decoding.
Hardware Coverage
- AMD optimizations. cc @HaiShaw
- CK kernels
- Setup CI (accuracy/performance) for AMD
- Intel XPU support.
Model Coverage
- Multi-modal models
- Llama 3.2 Vision Llama3.2 vision model support #1551
- QWen2-VL Support qwen2 vl model #1546
- DeepSeek VL2 [Feature] Support DeepSeek VL 2 #2653
- mistralai/Pixtral [Feature] Support mistralai/Pixtral #2351
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA https://arxiv.org/abs/2412.04468
- InternVL
- Phi-vision
- FishSpeech audio model support
- Ultravox
- Language models
- Mamba models @rahulbatra85 @HaiShaw
- xLSTM
- Reward models
New features
- Integrate with LMCache https://github.com/LMCache/LMCache
- A padded batch mode to make results more deterministic
Line 3 in 8912b76
- Performance optimizations for multi-LoRA serving [LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728
Quantization
- Torchao integration Add llama implementation with no tensor parallel linears #1561
- Turbomind operators integration
- More CUTLASS mixed precision gemm integration
- KV cache quantization (more formats + scaling factor)
Server API
- Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
- Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
- Better APIs to support RL trainers, including https://github.com/huggingface/trl and https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20
- Support generalized reward API (adding linear layers to any Causal LM to get the reward) https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20
Observability
- Integrate Grafana / Prometheus
Others
- Notebook-style interactive tutorials. @zhaochenyang20
- Compiler mode optimizations for the language (e.g. support sending a full serialized SGL program to the server). @hnyls2002
- Memory pool refactor to better support mixing different attention layers (e.g., interleaved window attention). @Ying1123
- Make vLLM an optional dependency. @zhyncs @ByronHsu @yizhang2077 [Feature] Make vLLM optional in model code #1673
Metadata
Metadata
Assignees
Labels
No labels