Skip to content

Instruction for Running DeepSeek with Large-scale PD and EP #6017

Open
@fzyzcjy

Description

@fzyzcjy

Using main branch

NOTE: The feature is already on main, but the performance still needs some improvements on main branch. will be good after a few already opened PRs - PR 6680, 6727, 6728

NOTE: I will try other config like 4 node for P and 9 node for D later. updated

Environment Preparation

Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake.

4P + 9D experiments

Start server
where DeepEP config can be tuned by #6742

# prefill nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config YOUR_PATH

# decode nodes
MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.835 --max-running-requests 18432 --context-length 4500 --init-expert-location YOUR_PATH --ep-num-redundant-experts 32 --cuda-graph-bs 256 --num-reserved-decode-tokens YOUR_VALUE

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

Benchmark for prefill

# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup

Benchmark for decode

  • It is suggested to use 3 prefill nodes and 9 decode nodes to reproduce our results, since 9 decode nodes is half the size of that in DeepSeek’s blog.
  • SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS can be set to benchmark-output-len + 2 to maximize batch size.
  • The example below demonstrates how to use the slow_down debug feature to stress test decode nodes when there are not enough prefill nodes. If your test workload has enough prefill nodes, this can be omitted.
# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

4P + 9D + dynamic EPLB

May still have room for improvements, just preliminary tests.

# prefill
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 65536 --context-length 8192 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --ep-dispatch-algorithm dynamic --deepep-config YOUR_PATH

# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 18432 --context-length 4500 --enable-eplb --ep-num-redundant-experts 32 --eplb-rebalance-num-iterations YOUR_VALUE --cuda-graph-bs 256  --num-reserved-decode-tokens YOUR_VALUE

Create expert distribution data

Need PR 6964, 6967

# prefill
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=4 MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.1:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode normal --mem-fraction-static 0.82 --chunked-prefill-size 524288 --max-running-requests 8192 --max-total-tokens 131072 --context-length 8192 --ep-num-redundant-experts 32 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-config /host_home/primary_synced/tom_sglang_server/misc/deepep_vp.json

# decode
MC_TE_METRIC=true SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/host_home/temp_sglang_server2local SGLANG_TBO_DEBUG=1 PYTHONUNBUFFERED=1 python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.5:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 1 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --enable-two-batch-overlap --expert-distribution-recorder-mode stat --disable-overlap-schedule --expert-distribution-recorder-buffer-size -1 --deepep-mode low_latency --mem-fraction-static 0.81 --max-running-requests 18432 --context-length 4500 --ep-num-redundant-experts 32 --cuda-graph-bs 256  --num-reserved-decode-tokens YOUR_VALUE

curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/start_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/start_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.55.1:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup 
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/slow_down' -d '{"forward_sleep_time": null}' 
# after a while
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.1:30000/dump_expert_distribution_record' -d '{}' 
curl -X POST -H 'Content-Type: application/json' 'http://10.5.55.5:30000/dump_expert_distribution_record' -d '{}' 

Then you will get one .pt file for prefill and one for decode. They can be used in --init-expert-location.

Using the blog branch

Environment Preparation

It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.

Stress-testing Prefill Nodes

# prefill nodes
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*131072)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --ep-dispatch-algorithm random

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests $((${num_decode}*1024)) --context-length 4500 --init-expert-location YOUR_EXPERT_LOCATION_HERE --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup

Stress-testing Decode Nodes

# prefill nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*65536)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131076 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=YOUR_NUM_HERE SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.846 --chunked-prefill-size 81920 --max-running-requests $((${num_decode}*2048)) --context-length 4096 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 256 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

Analyzing Results

Since we are stress testing one side of P or D, we need to look at the server logs instead of benchmark script outputs.

  • Prefill: For logs like Prefill batch. ... #new-token: 16384 ... gap_latency: 2.561, the performance is 16384 / 2.561 token/second/device.
  • Decode: The result can be read from gen throughput (token/s) in the logs.

Remarks

  • Please ensure the batch size is full and avoid padding, because the performance is suboptimal otherwise due to a bug we will address soon.
    • For example, to ensure a batch size of 256 for 72 decode GPUs, it is reasonable to send 40000 requests.
  • The sample command above only captures a CUDA graph of size 256 to save memory, which can be modified to suit your scenarios.
  • For optimal performance, you may need to tune components such as DeepEP on your cluster.
  • DeepGEMM warmup during execution will cause seemingly slow overall performance, and should be excluded from analyzation.
  • We rushed in the last few days, so the code is really ugly now with many hacks. We will make it elegant when merging into master.
  • For expert distribution statistics, our experiments use the same as input/output data and provide them as follows for reproducibility: attachment_ep_statistics.zip
  • To debug prefill performance, it may be useful to temporarily use --ep-dispatch-algorithm fake_grouped_uniform to simulate a fake perfect EPLB, and should match the corresponding performance reported in the blog
  • To analyze performance, it is suggested to use the log instead of benchmark script output, because the script output is mixed with the starting and ending part, where the system is not fully utilized and is slow.

Report Template

If you face any issues, feel free to discuss here or in Slack channel, and it would be great to provide the following information:

  • Full command to start server and benchmark
  • Logs of all server nodes and benchmark

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions