Skip to content

Commit 5526f28

Browse files
merrymercylifuhuang
authored andcommitted
Revert "fix some typos" (sgl-project#6244)
1 parent f613d14 commit 5526f28

File tree

95 files changed

+276
-276
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

95 files changed

+276
-276
lines changed

3rdparty/amd/profiling/PROFILING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -356,7 +356,7 @@ client.sh
356356
# Start profiling via API
357357
curl http://localhost:30000/start_profile -H "Content-Type: application/json"
358358
359-
# Benchmark serving using SGLang with a random dataset and tokenizer
359+
# Benchmark serving using sglang with random dataset and tokenizer
360360
# Define the log file with a timestamp
361361
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
362362
LOGFILE="sglang_client_log_$TIMESTAMP.json"

3rdparty/amd/tuning/TUNING.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -93,21 +93,21 @@ TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDU
9393
#Inference with large improvement on AMD GPU
9494
TORCHINDUCTOR_FREEZING=1 your_script.sh
9595
```
96-
## 4. Fused MoE kernel
97-
To maximize MoE kernel efficiency, need to use below scripts to find out the best launch configuration
96+
## 4. Fused MOE kernel
97+
To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
9898

9999
### Key parameters:
100-
- **--model**: what MoE model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
100+
- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
101101
- **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
102-
- **--batch**: M dimension size of MoE kernel, for prefill MoE kernel the value is batch*input_len, for decode MoE kernel the value is batch
102+
- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
103103
- **--dtype**: computation type
104104

105105
```bash
106106
#Tuning
107-
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in MoE view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
108-
#so we can tune decode MoE use below command
107+
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
108+
#so we can tune decode moe use below command
109109
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
110-
# and use this command to tune prefill MoE
110+
# and use this command to tune prefill moe
111111
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
112112
```
113113

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ SGLang is a fast serving framework for large language models and vision language
4444
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
4545
The core features include:
4646

47-
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (PagedAttention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
47+
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
4848
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
4949
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
5050
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.

benchmark/benchmark_vllm_060/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0
22

3-
In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set GPU utilization to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
3+
In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
44

55
## Online benchmark results
66

@@ -41,12 +41,12 @@ In short, with multi step enabled, in online scenarios that we benchmarked, the
4141
## Installation
4242

4343
```bash
44-
# install SGLang v0.3.0
44+
# install sglang v0.3.0
4545
pip install --upgrade pip
4646
pip install "sglang[all]"==0.3.0
4747
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
4848

49-
# install vLLM v0.6.0
49+
# install vllm v0.6.0
5050
pip install vllm==0.6.0
5151
```
5252

benchmark/deepseek_v3/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,10 @@ Add [performance optimization options](#performance-optimization-options) as nee
4545

4646
### Performance Optimization Options
4747

48-
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations that can be enabled as needed.
48+
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
4949

5050
- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
51-
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while the server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
51+
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
5252

5353
### Example: Sending requests with OpenAI API
5454

@@ -90,7 +90,7 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.
9090

9191
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
9292
93-
### Example: Serving with two H200\*8 nodes and Docker
93+
### Example: Serving with two H200\*8 nodes and docker
9494

9595
There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
9696
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
@@ -147,7 +147,7 @@ docker run --gpus all \
147147

148148
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
149149

150-
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assuming the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can run the following commands to launch the server.
150+
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
151151

152152
```bash
153153
# node 1
@@ -178,7 +178,7 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1
178178

179179
### Example: Serving with 8 A100/A800 with AWQ Quantization
180180

181-
Add the `--quantization moe_wna16` flag to enable the MoE wna16 kernel for better performance.
181+
Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
182182
One example is as follows:
183183

184184
```bash
@@ -188,12 +188,12 @@ python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --
188188

189189
### Example: Serving with 16 A100/A800 with int8 Quantization
190190

191-
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to HuggingFace. One example is as follows:
191+
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
192192

193193
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
194194
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
195195

196-
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can run the following commands to launch the server:
196+
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
197197
```bash
198198
#master
199199
python3 -m sglang.launch_server \
@@ -225,7 +225,7 @@ Running with per-channel quantization model:
225225

226226
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
227227

228-
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can run the following commands to launch the server:
228+
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
229229

230230
```bash
231231
#master

benchmark/gsm8k/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
## Run Benchmark
1+
## Run benchmark
22

3-
### Benchmark SGLang
3+
### Benchmark sglang
44
```
55
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
66
```
@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
1010
```
1111

1212

13-
### Benchmark vLLM
13+
### Benchmark vllm
1414
```
1515
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
1616
```
@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
2020
```
2121

2222

23-
### Benchmark LightLLM
23+
### Benchmark lightllm
2424
```
2525
# A10G
2626
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
3131
```
3232

3333

34-
### Benchmark Guidance
34+
### Benchmark guidance
3535
```
3636
python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
3737
```
3838

3939

40-
### Benchmark LMQL
40+
### Benchmark lmql
4141
```
4242
CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
4343
```

benchmark/hellaswag/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Run benchmark
22

3-
### Benchmark SGLang
3+
### Benchmark sglang
44
```
55
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
66
```
@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
1010
```
1111

1212

13-
### Benchmark vLLM
13+
### Benchmark vllm
1414
```
1515
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
1616
```
@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
2020
```
2121

2222

23-
### Benchmark LightLLM
23+
### Benchmark lightllm
2424
```
2525
# A10G
2626
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
3131
```
3232

3333

34-
### Benchmark Guidance
34+
### Benchmark guidance
3535
```
3636
CUDA_VISIBLE_DEVICES=0,1 python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
3737
```
3838

3939

40-
### Benchmark LMQL
40+
### Benchmark lmql
4141
```
4242
lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
4343
```

benchmark/kernels/fused_moe_triton/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This directory contains benchmarking tools for MoE (Mixture of Experts) kernels.
44

55
### Tuning Tool
66

7-
- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vLLM's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
7+
- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vllm's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
88

99
Example usage:
1010
```bash
@@ -48,7 +48,7 @@ After tuning, a configuration file (e.g., `E=64,N=640,device_name=NVIDIA_GeForce
4848

4949
### Performance Comparison Tool
5050

51-
- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vLLM and SGLang implementations. Supports various model architectures and data types.
51+
- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vllm and sglang implementations. Supports various model architectures and data types.
5252

5353
Example usage:
5454
```bash

benchmark/mmlu/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
## Download Data
1+
## Download data
22
```
33
bash download_data.sh
44
```
55

6-
## Run Benchmark
6+
## Run benchmark
77

8-
### Benchmark SGLang
8+
### Benchmark sglang
99
```
1010
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
1111
```
@@ -19,7 +19,7 @@ python3 bench_sglang.py --nsub 10
1919
python3 bench_sglang.py --backend gpt-3.5-turbo --parallel 8
2020
```
2121

22-
### Benchmark vLLM
22+
### Benchmark vllm
2323
```
2424
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
2525
```
@@ -29,7 +29,7 @@ python3 bench_other.py --nsub 10 --backend vllm
2929
```
3030

3131

32-
### Benchmark LightLLM
32+
### Benchmark lightllm
3333
```
3434
# A10G
3535
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -43,13 +43,13 @@ python3 bench_other.py --nsub 10 --backend lightllm
4343
```
4444

4545

46-
### Benchmark Guidance
46+
### Benchmark guidance
4747
```
4848
python3 bench_other.py --nsub 10 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
4949
```
5050

5151

52-
### Benchmark LMQL
52+
### Benchmark lmql
5353
```
5454
CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
5555
```

benchmark/mtbench/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
wget -O question.jsonl https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl
55
```
66

7-
## Run Benchmark
7+
## Run benchmark
88

9-
### Benchmark SGLang
9+
### Benchmark sglang
1010
```
1111
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
1212
```
@@ -15,7 +15,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
1515
python3 bench_sglang.py --num-questions 80
1616
```
1717

18-
### Benchmark SGLang EAGLE
18+
### Benchmark sglang EAGLE
1919
```
2020
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algo EAGLE \
2121
--speculative-draft lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
@@ -27,7 +27,7 @@ python3 bench_sglang_eagle.py --num-questions 80 --parallel 1
2727
```
2828

2929

30-
### Benchmark vLLM
30+
### Benchmark vllm
3131
```
3232
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
3333
```
@@ -37,7 +37,7 @@ python3 bench_other.py --num-questions 80 --backend vllm
3737
```
3838

3939

40-
### Benchmark LightLLM
40+
### Benchmark lightllm
4141
```
4242
# A10G
4343
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000

benchmark/multi_chain_reasoning/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
## Download Data
1+
## Download data
22
```
33
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
44
```
55

6-
## Run Benchmark
6+
## Run benchmark
77

8-
### Benchmark SGLang
8+
### Benchmark sglang
99
```
1010
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --schedule-conservativeness 1.3
1111
```
@@ -16,7 +16,7 @@ python3 bench_sglang.py --num-questions 32 --parallel 1
1616
```
1717

1818

19-
### Benchmark vLLM
19+
### Benchmark vllm
2020
```
2121
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
2222
```
@@ -26,7 +26,7 @@ python3 bench_other.py --num-questions 64 --backend vllm
2626
```
2727

2828

29-
### Benchmark LightLLM
29+
### Benchmark lightllm
3030
```
3131
# A10G
3232
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -37,12 +37,12 @@ python3 bench_other.py --num-questions 64 --backend lightllm
3737
```
3838

3939

40-
### Benchmark Guidance
40+
### Benchmark guidance
4141
```
4242
python3 bench_other.py --num-questions 8 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
4343
```
4444

45-
### Benchmark LMQL
45+
### Benchmark lmql
4646

4747
```
4848
python3 bench_other.py --num-questions 64 --backend lmql --parallel 1

0 commit comments

Comments
 (0)