[Bug] when dp<ep, deepep moe occurs an error when running on 4*H800 nodes

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

In the 4x8xH800 environment, when dp=8 and tp=32, the following error occurs:
`[2025-04-23 03:12:13 DP1 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1466, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1390, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1173, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1293, in forward_ffn_with_scattered_input
    hidden_states, residual = self.post_attention_layernorm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 18, in forward
    return self._forward_method(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/layernorm.py", line 71, in forward_cuda
    fused_add_rmsnorm(x, residual, self.weight.data, self.variance_epsilon)
  File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/elementwise.py", line 74, in fused_add_rmsnorm
    torch.ops.sgl_kernel.fused_add_rmsnorm.default(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
RuntimeError: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2001, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219, in initialize
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 277, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128`

### Reproduction

python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3      --host 0.0.0.0 --port 6178      --tp 32      --dp 8      --enable-dp-attention      --disable-radix-cache      --trust-remote-code      --chunked-prefill-size 4096      --enable-deepep-moe      --max-running-requests 128      --disable-radix-cache      --mem-fraction-static 0.8      --stream-output      --deepep-mode low_latency      --moe-dense-tp-size 1      --cuda-graph-max-bs 128 --dist-init-addr xxxx:20000 --nnodes 4  --node-rank 0

### Environment

4x8xH800

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] when dp<ep, deepep moe occurs an error when running on 4*H800 nodes #5656

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] when dp<ep, deepep moe occurs an error when running on 4*H800 nodes #5656

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions