[Bug]: DSR1 with DEP OOM during initialization on 32xH100

### Your current environment

<details>
<summary>4 nodes of DGX 8xH100 each.</summary>

```text
4 nodes of DGX 8xH100 each. Current main of vLLM.
```

</details>


### 🐛 Describe the bug

I try to run DSR1 with DEP32 with HT kernels, my run cmd looks like this (on node 2):

```bash
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" \
  VLLM_USE_DEEP_GEMM=1 \
  VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
  vllm serve deepseek-ai/DeepSeek-R1 \
  --data_parallel_size 32 \
  --data-parallel-size-local 8 \
  --enable-expert-parallel \
  --max-model-len 10240 \
  --enforce-eager \
  --data-parallel-address eos0391 \
  --data-parallel-rpc-port 13345 \
  --data-parallel-start-rank 8 \
  --headless \
  | tee ./dsr1_dep32_node2.log
```

I am able to run with max model len 128, but not even 10240 on 32 H100s due to OOM, which doesn't seem right. Low latency kernel works fine. Is that expected?

I have seen https://github.com/vllm-project/vllm/pull/19298 was addressing this issue but I still get OOM errors @varun-sundar-rabindranath .

I have attached the logs. OOM error in logs from node 4.

[dsr1_dep32_node4.log](https://github.com/user-attachments/files/21038108/dsr1_dep32_node4.log)
[dsr1_dep32_node3.log](https://github.com/user-attachments/files/21038109/dsr1_dep32_node3.log)
[dsr1_dep32_node2.log](https://github.com/user-attachments/files/21038107/dsr1_dep32_node2.log)
[dsr1_dep32_node1.log](https://github.com/user-attachments/files/21038106/dsr1_dep32_node1.log)

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: DSR1 with DEP OOM during initialization on 32xH100 #20441

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: DSR1 with DEP OOM during initialization on 32xH100 #20441

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions