Closed
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Adoption
SGLang adoption for DeepSeek V3 and R1
Usage
User Guide for Existing System (Installation & Launch)
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Please use the latest version v0.4.2.post4. Please prefer to use docker image. docker pull lmsysorg/sglang:latest
For running on AMD MI300X, use this as a reference. Running DeepSeek-R1 on a single NDv5 MI300X VM
Features
- Support CUDA Graph @HandH1998 @ispobock
- Support Torch compile @ispobock
- Use BF16 for bmm @zhyncs
- Improve the accuracy for FP8 @HandH1998 @zhyncs @ispobock
- Tuning FP8 GEMM @HandH1998 @zhyncs
- Replace
moe_align_block_size
@HandH1998 @zhyncs @BBuf - FusedMoE tuning for H200
E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json
@BBuf - TP+DP Attention @Ying1123
- Support overlap scheduler with DP attention @merrymercy
- Fuse Sigmoid Gate moe_kernels.cu @NovTi @BBuf (torch compile is sufficient for this use case, so the priority and ROI to support it are not high. Closing for now.)
- Support
nextn
speculative decoding @ispobock [Track] DeepSeek V3/R1 nextn progress #3472 - FP8 GEMM CUTLASS implementation @yizhang2077
- Better fused_experts @BBuf @zhyncs
- FlashInfer Prefill and MLA Decoding @zhyncs @ispobock
- Integrate DeepGemm linear support deepgemm #4199 Integrate DeepGemm contiguous group gemm into Fused MoE #4343
- Integrate FlashMLA Support FlashMLA backend #4472 Support FlashMLA backend cuda graph #4514
- FP8 GEMM Composable Kernel implementation @HaiShaw
- Support Pipeline Parallelism @Ying1123
More things (e.g., PD disaggregation, cache) are tracked at #4042