Description
Using main branch
NOTE: The feature is already on main, but the performance still needs some improvements on main branch. will be good after a few already opened PRs - PR 6680, 6727, 6728
NOTE: I will try other config like 4 node for P and 9 node for D later. updated
Environment Preparation
Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake.
4P + 9D experiments
Start server
where DeepEP config can be tuned by #6742
# prefill nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config YOUR_PATH
# decode nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.835 --max-running-requests 18432 --context-length 4500 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --cuda-graph-bs 256 --num-reserved-decode-tokens YOUR_VALUE
# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
Benchmark for prefill
# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup
Benchmark for decode
- It is suggested to use 3 prefill nodes and 9 decode nodes to reproduce our results, since 9 decode nodes is half the size of that in DeepSeek’s blog.
SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS
can be set tobenchmark-output-len + 2
to maximize batch size.- The example below demonstrates how to use the slow_down debug feature to stress test decode nodes when there are not enough prefill nodes. If your test workload has enough prefill nodes, this can be omitted.
# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup
# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
4P + 9D + dynamic EPLB
May still have room for improvements, just preliminary tests.
# prefill
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 65536 --context-length 8192 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --ep-dispatch-algorithm dynamic --deepep-config YOUR_PATH
# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 18432 --context-length 4500 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --cuda-graph-bs 256 --num-reserved-decode-tokens YOUR_VALUE
Create expert distribution data
Need PR 6964, 6967
# prefill
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=4 MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.1:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode normal --mem-fraction-static 0.82 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config /host_home/primary_synced/tom_sglang_server/misc/deepep_vp.json
# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.5:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode low_latency --mem-fraction-static 0.81 --max-running-requests 18432 --context-length 4500 --ep-num-redundant-experts 32 --cuda-graph-bs 256 --num-reserved-decode-tokens YOUR_VALUE
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/start_expert_distribution_record' -d '{}'
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/start_expert_distribution_record' -d '{}'
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": 90.0}'
python3 -m sglang.bench_one_batch_server --base-url http://10.5.55.1:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": null}'
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/dump_expert_distribution_record' -d '{}'
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/dump_expert_distribution_record' -d '{}'
Then you will get one .pt file for prefill and one for decode. They can be used in --init-expert-location.
Using the blog branch
Environment Preparation
- Install SGLang on branch https://github.com/sgl-project/sglang/tree/deepseek_ep
One branch that contains EPLB + Two Batch Overlap + dependencies #5524(EDIT: do not use this branch since I am adding more code to it after the blog, please use deepseek_ep instead)
Install DeepEP on branch Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine deepseek-ai/DeepEP#142- 2025.05.08 UPDATE: Directly use latest DeepEP main is enough, since my PR has been merged
- Install latest mooncake
It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.
Stress-testing Prefill Nodes
# prefill nodes
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*131072)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --ep-dispatch-algorithm random
# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests $((${num_decode}*1024)) --context-length 4500 --init-expert-location YOUR_EXPERT_LOCATION_HERE --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1
# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup
Stress-testing Decode Nodes
# prefill nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*65536)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131076 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache
# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=YOUR_NUM_HERE SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.846 --chunked-prefill-size 81920 --max-running-requests $((${num_decode}*2048)) --context-length 4096 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 256 --disable-radix-cache --decode-log-interval 1
# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup
# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
Analyzing Results
Since we are stress testing one side of P or D, we need to look at the server logs instead of benchmark script outputs.
- Prefill: For logs like
Prefill batch. ... #new-token: 16384 ... gap_latency: 2.561
, the performance is16384 / 2.561
token/second/device. - Decode: The result can be read from
gen throughput (token/s)
in the logs.
Remarks
- Please ensure the batch size is full and avoid padding, because the performance is suboptimal otherwise due to a bug we will address soon.
- For example, to ensure a batch size of 256 for 72 decode GPUs, it is reasonable to send 40000 requests.
- The sample command above only captures a CUDA graph of size 256 to save memory, which can be modified to suit your scenarios.
- For optimal performance, you may need to tune components such as DeepEP on your cluster.
- DeepGEMM warmup during execution will cause seemingly slow overall performance, and should be excluded from analyzation.
- We rushed in the last few days, so the code is really ugly now with many hacks. We will make it elegant when merging into master.
- For expert distribution statistics, our experiments use the same as input/output data and provide them as follows for reproducibility: attachment_ep_statistics.zip
- To debug prefill performance, it may be useful to temporarily use
--ep-dispatch-algorithm fake_grouped_uniform
to simulate a fake perfect EPLB, and should match the corresponding performance reported in the blog - To analyze performance, it is suggested to use the log instead of benchmark script output, because the script output is mixed with the starting and ending part, where the system is not fully utilized and is slow.
Report Template
If you face any issues, feel free to discuss here or in Slack channel, and it would be great to provide the following information:
- Full command to start server and benchmark
- Logs of all server nodes and benchmark