[Feature] DeepSeek V3 optimization

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.

### Adoption

[SGLang adoption for DeepSeek V3 and R1](https://github.com/sgl-project/sglang/discussions/3322)

### Usage

User Guide for Existing System (Installation & Launch)

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Please use the latest version [v0.4.2.post4](https://pypi.org/project/sglang/0.4.2.post4/). Please prefer to use docker image. `docker pull lmsysorg/sglang:latest`

For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726)

### Features

- [x] Support CUDA Graph @HandH1998 @ispobock 
- [x] Support Torch compile @ispobock 
- [x] Use BF16 for bmm @zhyncs 
- [x] Improve the accuracy for FP8 @HandH1998 @zhyncs @ispobock 
- [x] Tuning FP8 GEMM @HandH1998 @zhyncs 
- [x] Replace `moe_align_block_size` @HandH1998 @zhyncs @BBuf 
- [x] FusedMoE tuning for H200 `E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json` @BBuf 
- [x] TP+DP Attention @Ying1123 
- [x] Support overlap scheduler with DP attention @merrymercy
- [x] Fuse Sigmoid Gate  [moe_kernels.cu](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu) @NovTi @BBuf (torch compile is sufficient for this use case, so the priority and ROI to support it are not high. Closing for now.)
- [x] Support `nextn` speculative decoding @ispobock  https://github.com/sgl-project/sglang/issues/3472
- [x] FP8 GEMM CUTLASS implementation @yizhang2077 
- [x] Better [fused_experts](https://github.com/sgl-project/sglang/blob/34e405e01f7ff15ad56399999b9c00859a0b5134/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py#L1123) @bbuf @zhyncs 
- [x] FlashInfer Prefill and MLA Decoding @zhyncs @ispobock 
- [x] Integrate DeepGemm #4199 #4343
- [x] Integrate FlashMLA #4472 #4514 
- [ ] FP8 GEMM Composable Kernel implementation @HaiShaw 
- [ ] Support Pipeline Parallelism @Ying1123  

More things (e.g., PD disaggregation, cache) are tracked at https://github.com/sgl-project/sglang/issues/4042

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] DeepSeek V3 optimization #2591

Checklist

Adoption

Usage

Features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] DeepSeek V3 optimization #2591

Description

Checklist

Adoption

Usage

Features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions