Skip to content

[Bug] when dp<ep, deepep moe occurs an error when running on 4*H800 nodes #5656

Closed
@zhangml

Description

@zhangml

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

In the 4x8xH800 environment, when dp=8 and tp=32, the following error occurs:
`[2025-04-23 03:12:13 DP1 TP5] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1466, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1390, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1173, in forward
return self.forward_ffn_with_scattered_input(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1293, in forward_ffn_with_scattered_input
hidden_states, residual = self.post_attention_layernorm(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 18, in forward
return self._forward_method(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/layernorm.py", line 71, in forward_cuda
fused_add_rmsnorm(x, residual, self.weight.data, self.variance_epsilon)
File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/elementwise.py", line 74, in fused_add_rmsnorm
torch.ops.sgl_kernel.fused_add_rmsnorm.default(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in call
return self._op(*args, **kwargs)
RuntimeError: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2001, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261, in init
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75, in init
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181, in init
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219, in initialize
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 277, in init
raise Exception(
Exception: Capture cuda graph failed: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128`

Reproduction

python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxx:20000 --nnodes 4 --node-rank 0

Environment

4x8xH800

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions