Skip to content

Commit 156be8e

Browse files
authored
Merge branch 'main' into lora_cuda_graph
2 parents e4e94b0 + f036582 commit 156be8e

File tree

72 files changed

+1749
-744
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+1749
-744
lines changed

.github/workflows/nightly-test.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ jobs:
2525
- name: Install dependencies
2626
run: |
2727
bash scripts/ci_install_dependency.sh
28-
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
2928
3029
- name: Run test
3130
timeout-minutes: 120

.github/workflows/pr-test-amd.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ jobs:
3838
else
3939
DEVICE_FLAG="--device /dev/dri"
4040
fi
41-
docker pull lmsysorg/sglang:v0.4.5.post2-rocm630
41+
docker pull lmsysorg/sglang:v0.4.5.post3-rocm630
4242
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
4343
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
4444
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
4545
-w /sglang-checkout --name ci_sglang \
46-
lmsysorg/sglang:v0.4.5.post2-rocm630
46+
lmsysorg/sglang:v0.4.5.post3-rocm630
4747
4848
- name: Install dependencies
4949
run: |
@@ -82,12 +82,12 @@ jobs:
8282
else
8383
DEVICE_FLAG="--device /dev/dri"
8484
fi
85-
docker pull lmsysorg/sglang:v0.4.5.post2-rocm630
85+
docker pull lmsysorg/sglang:v0.4.5.post3-rocm630
8686
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
8787
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
8888
--cap-add=SYS_PTRACE -e HF_TOKEN=${{ secrets.AMD_HF_TOKEN }} --security-opt seccomp=unconfined \
8989
-w /sglang-checkout --name ci_sglang \
90-
lmsysorg/sglang:v0.4.5.post2-rocm630
90+
lmsysorg/sglang:v0.4.5.post3-rocm630
9191
9292
- name: Install dependencies
9393
run: |
@@ -120,12 +120,12 @@ jobs:
120120
else
121121
DEVICE_FLAG="--device /dev/dri"
122122
fi
123-
docker pull lmsysorg/sglang:v0.4.5.post2-rocm630
123+
docker pull lmsysorg/sglang:v0.4.5.post3-rocm630
124124
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
125125
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
126126
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
127127
-w /sglang-checkout --name ci_sglang \
128-
lmsysorg/sglang:v0.4.5.post2-rocm630
128+
lmsysorg/sglang:v0.4.5.post3-rocm630
129129
130130
- name: Install dependencies
131131
run: |

.github/workflows/pr-test.yml

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ jobs:
5454
strategy:
5555
fail-fast: false
5656
matrix:
57-
part: [0, 1, 2, 3, 4, 5, 6]
57+
part: [0, 1, 2, 3, 4, 5, 6, 7]
5858
steps:
5959
- name: Checkout code
6060
uses: actions/checkout@v4
@@ -64,10 +64,10 @@ jobs:
6464
bash scripts/ci_install_dependency.sh
6565
6666
- name: Run test
67-
timeout-minutes: 40
67+
timeout-minutes: 30
6868
run: |
6969
cd test/srt
70-
python3 run_suite.py --suite per-commit --auto-partition-id ${{ matrix.part }} --auto-partition-size 7
70+
python3 run_suite.py --suite per-commit --auto-partition-id ${{ matrix.part }} --auto-partition-size 8
7171
7272
unit-test-backend-2-gpu:
7373
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
@@ -87,6 +87,26 @@ jobs:
8787
cd test/srt
8888
python3 run_suite.py --suite per-commit-2-gpu
8989
90+
unit-test-backend-8-gpu:
91+
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
92+
github.event.pull_request.draft == false
93+
runs-on: 8-gpu-runner
94+
steps:
95+
- name: Checkout code
96+
uses: actions/checkout@v4
97+
98+
- name: Install dependencies
99+
env:
100+
FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu124/torch2.5/flashinfer-python' || 'https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python' }}
101+
run: |
102+
bash scripts/ci_install_dependency.sh
103+
104+
- name: Run test
105+
timeout-minutes: 40
106+
run: |
107+
cd test/srt
108+
python3 run_suite.py --suite per-commit-8-gpu
109+
90110
performance-test-1-gpu-part-1:
91111
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
92112
github.event.pull_request.draft == false
@@ -103,7 +123,7 @@ jobs:
103123
timeout-minutes: 10
104124
run: |
105125
cd test/srt
106-
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1
126+
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_bs1_default
107127
108128
- name: Benchmark online latency
109129
timeout-minutes: 10

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ FILES_TO_UPDATE = docker/Dockerfile.rocm \
2020
python/pyproject.toml \
2121
python/sglang/version.py \
2222
docs/developer/setup_github_runner.md \
23-
docs/start/install.md
23+
docs/start/install.md \
24+
benchmark/deepseek_v3/README.md
2425

2526
update: ## Update version numbers across project files. Usage: make update <new_version>
2627
@if [ -z "$(filter-out $@,$(MAKECMDGOALS))" ]; then \

benchmark/deepseek_v3/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Add [performance optimization options](#performance-optimization-options) as nee
3333

3434
```bash
3535
# Installation
36-
pip install "sglang[all]>=0.4.5.post3"
36+
pip install "sglang[all]>=0.4.6"
3737

3838
# Launch
3939
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

docker/Dockerfile.rocm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Usage (to build SGLang ROCm docker image):
2-
# docker build --build-arg SGL_BRANCH=v0.4.5.post3 -t v0.4.5.post3-rocm630 -f Dockerfile.rocm .
2+
# docker build --build-arg SGL_BRANCH=v0.4.6 -t v0.4.6-rocm630 -f Dockerfile.rocm .
33

44
# default base image
55
ARG BASE_IMAGE="rocm/sgl-dev:vllm20250114"

docs/backend/sampling_params.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de
3535

3636
* `frequency_penalty: float = 0.0`: Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token.
3737
* `presence_penalty: float = 0.0`: Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occured.
38-
* `repetition_penalty: float = 0.0`: Penalizes tokens if they appeared in prompt or generation so far. Must be between `0` and `2` where numbers smaller than `1` encourage repeatment of tokens and numbers larger than `1` encourages sampling of new tokens. The penalization scales multiplicatively.
3938
* `min_new_tokens: int = 0`: Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens.
4039

4140
### Constrained decoding

docs/backend/server_arguments.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,6 @@ Please consult the documentation below to learn more about the parameters you ma
6161
* `revision`: Adjust if a specific version of the model should be used.
6262
* `skip_tokenizer_init`: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF. Please see this [example for reference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/token_in_token_out/).
6363
* `json_model_override_args`: Override model config with the provided JSON.
64-
* `delete_ckpt_after_loading`: Delete the model checkpoint after loading the model.
6564
* `disable_fast_image_processor`: Adopt base image processor instead of fast image processor(which is by default). For more detail, see: https://huggingface.co/docs/transformers/main/en/main_classes/image_processor#image-processor
6665

6766

docs/developer/setup_github_runner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
1111
# Nvidia
1212
docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
1313
# AMD
14-
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.5.post3-rocm630 /bin/bash
14+
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.6-rocm630 /bin/bash
1515
# AMD just the last 2 GPUs
16-
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.5.post3-rocm630 /bin/bash
16+
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.6-rocm630 /bin/bash
1717
```
1818

1919
### Step 2: Configure the runner by `config.sh`

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ The core features include:
2020
:maxdepth: 1
2121
:caption: Backend Tutorial
2222

23-
references/llama4
2423
references/deepseek
24+
references/llama4
2525
backend/send_request.ipynb
2626
backend/openai_api_completions.ipynb
2727
backend/openai_api_vision.ipynb

docs/references/deepseek.md

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
# DeepSeek Usage
22

3-
SGLang provides several optimizations specifically designed for the DeepSeek model to boost its inference speed. This document outlines current optimizations for DeepSeek. Additionally, the SGLang team is actively developing enhancements for [DeepSeek V3](https://github.com/sgl-project/sglang/issues/2591).
3+
SGLang provides many optimizations specifically designed for the DeepSeek models, making it the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended) from Day 0.
4+
5+
This document outlines current optimizations for DeepSeek.
6+
Additionally, the SGLang team is actively developing enhancements following this [Roadmap](https://github.com/sgl-project/sglang/issues/2591).
47

58
## Launch DeepSeek V3 with SGLang
69

7-
SGLang is recognized as one of the top engines for [DeepSeek model inference](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). To run DeepSeek V3/R1 models, the requirements are as follows:
10+
To run DeepSeek V3/R1 models, the requirements are as follows:
811

912
| Weight Type | Configuration |
1013
|------------|-------------------|
@@ -60,15 +63,13 @@ Detailed commands for reference:
6063
- [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
6164

6265
### Download Weights
63-
6466
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
6567

6668
### Caching `torch.compile`
67-
6869
The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
69-
### Launch with One node of 8 H200
7070

71-
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.** Also, `--enable-dp-attention` can be useful to improve for Deepseek V3/R1's throughput. Please refer to [Data Parallelism Attention](https://docs.sglang.ai/references/deepseek.html#multi-head-latent-attention-mla-throughput-optimizations) for detail.
71+
### Launch with one node of 8 x H200
72+
Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.
7273

7374
### Running examples on Multi-node
7475

@@ -86,7 +87,7 @@ Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/be
8687

8788
- **Weight Absorption**: By applying the associative law of matrix multiplication to reorder computation steps, this method balances computation and memory access and improves efficiency in the decoding phase.
8889

89-
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), and [Triton](https://github.com/triton-lang/triton) backends. It can be set with `--attention-backend` argument.
90+
- **MLA Attention Backends**: Currently SGLang supports different optimized MLA attention backends, including [FlashAttention3](https://github.com/Dao-AILab/flash-attention), [Flashinfer](https://docs.flashinfer.ai/api/mla.html), and [Triton](https://github.com/triton-lang/triton) backends. The default FA3 provides good performance across wide workloads.
9091

9192
- **FP8 Quantization**: W8A8 FP8 and KV Cache FP8 quantization enables efficient FP8 inference. Additionally, we have implemented Batched Matrix Multiplication (BMM) operator to facilitate FP8 inference in MLA with weight absorption.
9293

@@ -100,13 +101,13 @@ Overall, with these optimizations, we have achieved up to **7x** acceleration in
100101
<img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models">
101102
</p>
102103

103-
**Usage**: MLA optimization is enabled by default. To disable chunked prefix cache feature for mla, use `disable-chunked-prefix-cache`.
104+
**Usage**: MLA optimization is enabled by default.
104105

105106
**Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details.
106107

107108
### Data Parallelism Attention
108109

109-
**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer.
110+
**Description**: This optimization involves data parallelism (DP) for the MLA attention mechanism of DeepSeek Series Models, which allows for a significant reduction in the KV cache size, enabling larger batch sizes. Each DP worker independently handles different types of batches (prefill, decode, idle), which are then synchronized before and after processing through the Mixture-of-Experts (MoE) layer. If you do not use DP attention, KV cache will be duplicated among all TP ranks.
110111

111112
<p align="center">
112113
<img src="https://lmsys.org/images/blog/sglang_v0_4/dp_attention.svg" alt="Data Parallelism Attention for DeepSeek Series Models">
@@ -119,8 +120,8 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin
119120
</p>
120121

121122
**Usage**:
122-
- This optimization is aimed at improving throughput and should be used for scenarios with high QPS (Queries Per Second). It can be enabled by `--enable-dp-attention` for DeepSeek models.
123-
- Since v0.4.4, DP and TP attention can be flexibly combined. For example, to deploy DeepSeek-V3/R1 on 2 node with 8*H100, you can specify `--tp 16` and `--dp 2`, which means for attention part there are 2 DP groups, and in each DP group there are 8 TP groups.
123+
- Append `--enable-dp-attention --tp 8 --dp 8` to the server arguments when using 8 H200 GPUs. This optimization improves peak throughput in high batch size scenarios where the server is limited by KV cache capacity. However, it is not recommended for low-latency, small-batch use cases.
124+
- DP and TP attention can be flexibly combined. For example, to deploy DeepSeek-V3/R1 on 2 nodes with 8 H100 GPUs each, you can specify `--enable-dp-attention --tp 16 --dp 2`. This configuration runs attention with 2 DP groups, each containing 8 TP GPUs.
124125

125126
**Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models).
126127

@@ -192,10 +193,31 @@ Expected Response
192193
{"id": "62af80528930423a82c806651ec66e7c", "object": "chat.completion", "created": 1744431333, "model": "deepseek-ai/DeepSeek-V3-0324", "choices": [{"index": 0, "message": {"role": "assistant", "content": null, "reasoning_content": null, "tool_calls": [{"id": "0", "type": "function", "function": {"name": "query_weather", "arguments": "{\\"city\\": \\"Guangzhou\\"}"}}]}, "logprobs": null, "finish_reason": "tool_calls", "matched_stop": null}], "usage": {"prompt_tokens": 118, "total_tokens": 140, "completion_tokens": 22, "prompt_tokens_details": null}}
193194
194195
```
195-
196+
Sample Streaming Request:
197+
```
198+
curl "http://127.0.0.1:30000/v1/chat/completions" \
199+
-H "Content-Type: application/json" \
200+
-d '{"temperature": 0, "max_tokens": 100, "model": "deepseek-ai/DeepSeek-V3-0324","stream":true,"tools": [{"type": "function", "function": {"name": "query_weather", "description": "Get weather of an city, the user should supply a city first", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city, e.g. Beijing"}}, "required": ["city"]}}}], "messages": [{"role": "user", "content": "Hows the weather like in Qingdao today"}]}'
201+
```
202+
Expected Streamed Chunks (simplified for clarity):
203+
```
204+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"{\""}}]}}]}
205+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"city"}}]}}]}
206+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\":\""}}]}}]}
207+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"Q"}}]}}]}
208+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"ing"}}]}}]}
209+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"dao"}}]}}]}
210+
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"\"}"}}]}}]}
211+
data: {"choices":[{"delta":{"tool_calls":null}}], "finish_reason": "tool_calls"}
212+
data: [DONE]
213+
```
214+
The client needs to concatenate all arguments fragments to reconstruct the complete tool call:
215+
```
216+
{"city": "Qingdao"}
217+
```
196218
Important Notes:
197219
1. Use a lower `"temperature"` value for better results.
198-
2. Currently, the function calling implementation for deepseek is incompatible with streaming requests.
220+
199221

200222

201223
## FAQ

0 commit comments

Comments
 (0)